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A>stract: As an al t ernati ve t o usi ng a bi del t a net work t opol ogy for 1 arge Trans i t net- 
works, I consider the requirements to extend the base Transit network into Leiserson's 
Fat-Tree configuration. Transit will be a high-speed, lowlatency, f aul t- 1 ol erant network 
i nter connect i on for hi gh per for nance mil ti - processor computers. The initial interconnect 
scheme pi anned for Trans i t will useabideltastyle network to support up to 256 pr oces sor s . 
Seal i ng beyond 256 processors by si npl y extendi ng that network t opol ogywill result ina 
uni f ormdegradati on of network 1 at ency across processors. Af at -tree network structure will 
allowthe Transit network to be scaled arbitrarily while taking advantage of the locality 
and universality of fat-trees to nininize the iirpact of scaling upon network latency. I 
consi der the t opol ogy and constructi on i s sues for i nt egrati ng the Trans i t rout i ng network 
conponent and t echnol ogy i nt o a f at- 1 ree configurat i on. I al so charact eri ze the resul ti ng 
network' s size, locality, and per for nance and conpare these char act eri sties with those of 
bi de 1 1 a ne t wor ks . 
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1. Introduction 



1. 1 Overview 

Transit [Knight 89] is a hi gh- perf ornance network for large scale MM) corrput er s . 
Trans i t is i nt ended to provi de the under 1 yi ng network support for a wi de range of paral 1 el 
processing paradigms including message passing, shared memory, and dataflow. Tran- 
sit provides 1 ow- 1 at ency i nter connect bet ween processor s vi a an i ndi rect circuit switched 
network. The network is cons truct ed as a bi del t a mil ti - s t age shuffle exchange network 
[ Kruskal 86] utilizing4x4 crossbar r outi ng el ement s . The Trans i t bi del t a network al 1 ows 
connect i ons between source and desti nati on to be made through a f ewrouti ng stages for 
moderate size networks . In or der t o provi de f aul t- 1 ol e ranee and i nprove network routi ng 
efficiency, redundant paths are provi ded through the network. Routing switches are kept 
simple, and thus fast, by i npl ementi ng a sour ce responsi bl e connecti on protocol ; this frees 
t he i ndi vi dual routi ng el ement s f romthe conpl exi t y of col 1 i si on avoi dance. 

For moderate sized networks (i.e. 64 to 256 processors), thebidelta network cons t ruc- 
tion keeps the del ay uni f orni y smal 1 between all processors in the network. Addi ti onal 1 y, 
usi ng thr ee- di mensi onal wiring strategies, the size and wi r e 1 ength can be kept reasonabl y 
snal 1 . Thi s al 1 ows the network to be constructed i n a si ngl e physi cal package wi t hi n current 
t e chnol ogy 1 i m t at i ons . 

Seal i ng to 1 ar ger network sizes by si npl y extendi ng thi s strategy 1 eads to a f ewdi fftul - 
ti es . Net work del ays become uni f ornl y 1 ar ge. The network i t sel f becomes 1 arge enough that 
i t mist be di vi ded i nt o mil ti pi e part s f or packagi ng. In addi t i on, wi re 1 engths grow such 
that reduci ng the cl ock rat e of the system, due to wi r e del ays , becomes a seri ous concern. 

Tb avoid this uni for mper for nance degradation, I propose a scheme for constructing 
1 arge r networks usi ng Lei ser son' s vol ume- uni ver sal fat- tree network [Leiserson85]. In thi 
manner, locality can be expl oi t ed to provi de short i nter connect between closely si t uat ed 
processors; at the same time, i nt er connecti on bet ween more distant processors is still possi 
bl e wi thout si gni fie ant performance r educti on f romthat of the bi del t a network. The fat- tree 
network al so al 1 ows the network t o be decomposed i nt o consti t uent el ement s for packagi ng 
i n a s t r ai ght f or war d manner . 

The extensionto a f at -tree network scheme pri rrari 1 y i mpact s i nter connect i on t opol ogy 
and routi ng schemes . I descri be how the fat - tree routi ng network can be realizedutilizing 
the existing Transit routing element and packaging technology. This extension can be 
i npl ement edwithni nor revi si ons to the Trans i t routi ng component . 

In the remainder of this chapter, I briefly review the relevant properties of Transit 
( Se c t i ons 1.2 t hr ough 1.5) and fat-trees (Section 1.6). In Qiapt e r 2 , I de ve 1 op s one of t he 
topology issues for building the fat- tree network structure. Chapter 3 follows providing 
concrete possi bi 1 i ti es for the real i z at i on of the fat- tree network. Chapt er 4 then proceeds 1 
quant i f y some of the propert i es of the network and compares i t wi th Trans i t bi del t a networks 
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Fi gure 1.1: Network Processor Interface 
of conparable size. Finally, Chapter 5 closes wi th concl usi ons andfuture directions. 

1.2 Processor I*et work Interface Model 

The development provi ded throughout this paper is concerned enti rel y wi th the con- 
struction of i nt er connect i on networks . In or der for the network t o be useful i n t he context 
of a 1 arge seal e par al lei conput er , i t mist i nt erf ace coherentl y wi th i t s set of processor; 
The network processor interface is shown i n Fi gure 1.1. Each network endpoi nt is a pro- 
cessor withits own 1 ocal me nor y and a cache- control 1 er for nai nt ai ni ng i t s 1 ocal cache 
and keepi ng the cache coherent withthe rest of the network. Eachprocessor has a pai r of 
i nput s and a pai r of output s t o the network. Trans i t desi gn i nt ends the cache- control 1 er t o 
serve as the processor's interface to the network so that the processor need not expl i ci tl y 
deal wi th net work i nter act i ons ; however, thi s i s a separate archi tectural i ssue and need not 
necessari 1 y be the case. 

The net work input and out put connections are pai red for fault tolerance and improved 
rout i ng success pr obabi 1 i t i es . Faul t t ol e ranee issues are di sc us sed further inSectionl. 
The processor will generally only use one of its two inputs to the network, guaranteeing 
that the net work wi 1 1 never be 1 oaded above 50%. 

1.3 Routing Gbnponent 



Basi c switchingis provi ded by the routi ng el errent , EM. Thi s i s a cust omCMDS 
r out i ng component desi gned to provi de si npl e hi gh speed s wi t chi ng. BN1 has ei ght ni ne- bi t 
wi de i nput channel s and ei ght ni ne- bi t wi de out put channel s . Thi s provi des byt e wi de dat a 
transfer wi th the ni nth bi t servi ng as a si gnal for the begi nni ng and end of tr ansni ssi ons . 
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Figure 1.2: ENl Logi cal Configurations 



BN1 can be configured, in one of two ways as shown in Figure 1.2. ENl' s prinary 
configurati onis as a4x4cros sbar rout er wi th 2 equi val ent output s ineachlogical di recti on. 
In this configuration, all 8 i nput channel s are 1 ogi cal 1 y equi val ent . Alternately, ENl can be 
configured as a pai r of 4 X 4 crossbars , eachwith4 1 ogi cal 1 y equi val ent i nput s and a si ngl e 
output ineachlogical direction. 

Si npl e routi ng i s perf orned by usi ng t he fir st t wo bi t s of a transni s si on to i ndi cat e the 
t he de s i r e d out put de s t i nat i on. I f an out put inthe de sired di recti on is avai 1 abl e , t he dat . 
transni s si on is routed to one such output . Otherwi se, the dat a i s i gnor ed. Ineither case, 
when the transni ssi on corrpl et es , RN1 i nf orris the sender of the connect i on status so that 
the sender wi 1 1 know whet her or not it is necessary to retry the transni ssi on. 

To al 1 owr api d responses to network request s , ENl al 1 ows connect i ons opened over the 
network to be turned around; that is, the direction of the connection can be reversed 
al 1 owi ng data to flowback f romthe dest i nat i on t o the source pr oces sor . The abi lityto turn 
a net work connect i on around al 1 ows a processor requesti ng data to get its response qui ckl y 
without r equi ri ng the processor it is cormuni cati ng wi th to open a separate connection 
t hr ough t he ne t wor k. 

Si nee ENl al ways 1 ooks at the most si gni fie ant t wo bi t s of the first byte of a transni ssi on 
to deterrrine switching di recti on, it is necessaryfor thebits withinthis byte to be r ot at e< 
between network st ages . Thi s rot at i on guar ant ees each network st age a fresh set of bits 
f romthe routi ng byte. The rot at i on property mist be provi ded by the net work wi ri ng 
s chene i n whi ch ENl i s used. 

Wen configured as a 4 X 4 crossbar wi th t wo output s i n each 1 ogi cal di recti on, ENl 
provides redundant paths through the network since it provides mil ti pie outputs in each 
1 ogi cal di recti on. Thi s serves toincrease both f aul t t ol e ranee and the probabi 1 i t y of routi : 
success . Wen both 1 ogi cal 1 y equi val ent output s ar e avai 1 abl e, ENl random y selects one of 
t he t wo for use. In thi s manner, networks bui 1 1 f romBNl wi 1 1 have a 1 eve 1 of f aul t t ol e ranee 
i n that mil ti pi e at t eirpt s to send a transni ssi on through the network will most likely take 
separate paths through the network. The randomsel ecti on of the preferred output port 
in each di recti on gives transni ssi ons a good chance of avoi di ng any f aul t y conponent ( s ) 
entirely in successive connect i on at t errpt s . 

The al t ernati ve configurati on for ENl as a pai r of 4x4 cros sbar s i s provi ded for further 
fault tolerance. As Section 1.2 described, there are two outputs fromthe network for 
each processor. Using only the standard ENl configuration, these two outputs would 
have to cone froma single routing conponent making that routing conponent critical to 
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Figure 1.3: HM Package (El agramcourt esy of Fred Etenckhahn) 

the proper f uncti oni ng of the net work [ Lei ght on 89-2]. Wththis al t ernat e configurati on, 
the two outputs fromthe network for each processor can cone fromtwo different BN1 
conponent s so nei t her conponent iscritical. Sectionl.5 expl ai ns thi s f aul t t ol e ranee i ssu 
inthe context of the Transit bi del t a net works . 

The Trans i t routi ng conponent is describedfurther in [ Kii ght 89] and [ Mnsky 90] . 



1.3.1 Physical Inscription 

BN1 wi 1 1 be packaged as a pad gri d array wi th al 1 si gnal s appeari ng on pads on both 
si des of the package. Fi gure 1. 3 shows a top and si de vi ewof the BN1 package. At ot al of 
76 peri phery pads will provi de through routi ng condui t s for signals. Hales in the package 
are provi ded for packaging alignment and coolant flow. Since BN1 is designed to drive 
144 si gnal pi ns si mil t aneousl y at lOOMlz, provi si ons f or 1 i qui d cool i ng are essenti al . The 
target physi cal statistics for BN1 are surmari zed i n Tabl e 1. 1. 
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Table 1.1: EN1 Physical Statistics 

1.3.2 R>ut i ng to mre than 256 destinations 

Usi ng the first byte to specify routi ng dest i nat i ons as descri bed above be cones 1 i ni t - 
i ng when attempting to address a large nunber of destinations. A single byte can only 
di st i ngui sh 256 di sti net desti nati ons . 

To address this potential problern BN1 has a provi si on f or droppi ng the 1 eadi ng byte 
of a s treamof data before 1 ooki ng at i t . It then uses the renai nder of the streamas i f the 
first byte never exi s ted. Thus, BN1 start s usi ng the newrouti ng byt e, whi ch was ori gi nal 1 y 
the second byte in the message, at the stage where the first was dropped. By properly 
designating the stages in the network at which components should drop, or swallov? the 
first byte i n thi s manner , wecanspecifyarbitrarily many di sti net desti nati ons . 

Thi s sudlowprqnty of a network routi ng stage is astatic property. In the si mpl y case 
of bi del t a networks , the s wal 1 ow property can be set onaper chipbasis, sinceall i nput s 
t o a network st age will need fresh routi ng bytes at the same ti me. For f ul 1 general i ty i n 
the networks consi dered her e, it is bene fie i al t o be abl e t o configure thi s property on a per 
input basis. This allows a single routing component to have inputs that connect to paths 
of var yi ng 1 engt hs . 

1.4 Cbnst ruction Technology 



The basic unit of network packaging for Transit is the stack. A stack is a three- 
dime nsi onal i nt er connect structure constructed by sandwi chi ng 1 ayer s of BN1 routi ng com 
ponent s between hori zont al pc- board 1 ayer s . The pc- boards per for mi nt er- stage wi ri ng and 
the bi t rot at i ons descri bed i n the previ ous secti on whi 1 e the routi ng stages provi de s wi t ch- 
i ng. Fi gure 1.4 shows a parti al cross- secti on of astack. The dominant di recti on of si gnal 
flow i s vertical as connections are made verti cal 1 y through the stack. At each horizon- 
tal r outi ng 1 ayer , each path through the net work wi 1 1 make a connecti on through a si ngl e 
routi ng component . Bet ween routi ng 1 ayer s , the connecti on i s routed hori zont al 1 y to the 
appropriate routing component in the next layer. 

Wen the t ransiri ssi on reaches the top of the routi ngstackit is brought strai ght down, 
back through t he st ack, to connect to the desti nati on processor s . Thi s is necessary because 
the set of source and desti nati on processors will normal 1 y be the same. M 1 routi ng through 
the layers of routing components is provided by the through routing pins on the BN1 
package as des cri bed i n the previ ous section. 
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Fi gure 1. 4: Goss- Sectionof H del t a Bout i ng Stack (El agramcour t esy of Fred Etenckhahn) 



Oont act i s nade bet ween the routi ng conponent s and the hori zont al pc- boards through 
butt on board can i ers . These carriers are thin boards , roughl y the sane size as the routi ng 
chip, wi t h but t on bal 1 s [Sm)lley85] al i gned t o each pad on the rout i ng chi p. These button 
bal Is are 25nicron spun wi re conpressed i nt o 20 nil di anet er by 40 ni 1 hi gh hoi es i n the 
but t on board connect or . They provide nultiple points of contact bet ween each routi ng 
conponent and hor i z ont al board when the stack is conpres sed t oget her; in this nanner 
they effect goodelectrical contact wi thout the need for sol der . Thi s al 1 ows ease of packagi ng 
cons true ti on and conponent re pi ace ire nt . 

Channel s are provi ded both i n the stack and through each routi ng conponent f or 1 i qui d 
cool i ng. FOG 77 Fl uori nert wi 1 1 be punped through these channel s t o provi de efffci ent heat 
removal duri ng operati on. 

At the targeted clock rate of lOOMlz for network operation, wire delay consumes a 
significant portion of the clock cycle. Thus, the physical size of the horizontal routing 



boards is aninportant consi derati on f or net work perfornance. Additionally, withcurrent 
t echnol ogy for f abri cati ng pc- boards , it is not possible f abri cat e pc- boards any 1 arger tha 
2' X 2' with reliable yield. 

Usi ng 12 1 ayer pc- boards for the hori zont al routi ng, a stackis expected to have a si de 
1 ength of roughl y: 2 X (chi p si de 1 engt h) X (nunber of chi ps across si de) . Each 1 ayer, i n- 
cl udi ng pc boards , rout i ng conponent s , and connect or s , wi'llaM .0.2£e height of a 
s t ack wi 1 1 be r oughl y: "0x2^nunber of r out i ng 1 ayer s ) . 

Anore det ai 1 ed des cri pti on of Trans i t packagi ng t echnol ogy is gi ven i n [ Kii ght 89] . 

1. 5 Faul t Tbl erance i n H del t a R>ut i ng St acks 

The network interface no del describedin Secti on 1. 2 al ong withtheproperties desi gned 
i nt o BN1 al 1 owa reasonabl e 1 evel of faul t t ol erance to be bui It i nt o Trans i t bi del t a networks . 

Two i nput s are provi ded f romthe processor or cache- control 1 er to the network. The 
processor is expected toutilize onl y one of these t wo i nput s at a gi venpoi nt of ti me. Leavi ng 
the second i nput unused guarantees that the network wi 1 1 never be congested above the 50% 
level (Sectionl.2). All owi ng the processor to chose whi ch of the t wo i nput s to the net work 
to actual 1 y use at the begi nni ng of each net work transact i on, prevent s a si ngl e link between 
the processor and network f r ombei ng critic al to the process or ' s abi 1 i t y to comruni cat e 
over the network. Wen t wo i nput s are provi ded t o the network through separate routi ng 
conponent s , we guar ant ee that no si ngl e conponent failure will isolate a set of processors 
f romthe network. 

Wthi n the network, the standard BN1 configurat i on provi des mil ti pi e output s i n each 
1 ogi cal di recti on. Thi s al 1 ows the nunber of di fferent paths through the network to expand. 
No si ngl e routi ng conponent i n the i nt eri or of the bi del t a net work is critical si nee there 
i s a corrpl et e path f romeach source to each desti nati on whi ch avoi ds any gi ven routi ng 
conponent . 

The final routing stage of the bi delta network is constructed with the BN1 routing 
components configured in the alternative manner in which BN1 acts as a pair of single 
output 4x4 crossbars. In this way, the final switching stage for the pair of outputs 
dest i ned t o the same processor can be di stri but ed across t wo di fferent routi ng conponent s . 
Si nee any connect i on to a gi venpr oces sor coul d be routed through ei ther output , and hence 
ei ther of the two routi ng conponent s i n thi s final stage, nei ther of the routi ng conponent s 
is cri ti cal . 

Amor e det ai 1 ed des cri pti on of thi s scheme for achi evi ng faul t t ol erance i n mil t i stage 
networks i s gi ven i n [ EeHbn 90] . 

1.6 Iat-1-ees 

Fat- Trees are uni ver sal routi ng networks f or i nt er connect i ngprocessors in a mil ti pro- 
cessor envi ronment [Leiserson85]. Fat- Trees i nt er connect processors usi ng a corrpl et e tree 
structure. The processors are 1 ocat ed at the 1 eaves of the tree whi le the tree's i nt ernal node 
conpose the i nt er connecti on net work. Fat-Trees parameterize the bandwi dth bet ween i n- 
ternal nodes according to their distance f romthe root node. In general, nodes closer to 




Fi gure 1.5: Area- Uii ver sal Fat- Tree wi th Constant Si ze Swi t ches ( Greenberg and Lei serson) 



the root have greater bandwi dth t o ace orniD date additional message tram:. Connections 

bet ween processors wi t hi n a sub- tree can be nade without consuning bandwi dth hi gher 

in the tree. Thi s al 1 ows locality to be expl oi t ed whi 1 e reservi ng cri ti cal "1 ong di stance 

bandwi dth for connect i ons between wi dely separated processors. Af at - tree ac hi eves these 

properties with onl y a 1 i near i ncrease i n t he nunber of routi ng stages that mist be traversed 

i n the worst case over a conparabl y sized bidelta net work. 

Fat- Trees are of parti cul ar i nt erest because they can be area- or vol ume- uni ver sal when 
the bandwi dth growth toward the root is selected properl y. Af at -tree is vol ume- uni ver sal 
if it can si mil ate any other network constructed f romthe same amount of hardware wi th at 
most a pol yl ogari t hiri c slowdown. This propert y guar ant ees that the hardware dedicated 
t o constructi ng the routi ng network can be uti 1 i zed effbi entl y. 

In [ Greenberg 85] Greenberg and Lei serson propose the area- uni ver sal fat- tree shown 
in Figure 1.5. The fat- tree shown demonstrates a basic constructi on struct ure for fat- 
trees usi ng constant size switching elements. The i nt ernal node capaci t y doubl es on each 
successi ve 1 evel toward the root. This average capaci ty growth i s suffbient to guarantee 
area- uni ver sal i ty for a quaternary fat- tree. 

Fat- Tree networks were developed by Charles Lei serson. He provides a detailed de- 
script i on i n [ Lei serson 85] as wel 1 as proofs for the uni versalityproperties of fat -trees. 
[ Greenberg 85], Leiserson and Greenberg expand the general i ty of fat- trees by provi ng that 
the uni ver sal i ty proper ti es hoi d f or on- 1 i ne routi ng al gori thrrs . 



2. Bis i c Cbnfigur at i on 



2. 1 IJrbri d lat- Tfree i^jproach 

The fat- tree network structure will beusedtoi nt er connect snal 1 Ttansi t bi del t a routi ng 
networks . That is, the leaf nodes of the fat -tree will therrsel ves be bidelta networks rather 
t han i ndi vi dual processors. This allows a moderate nurrber of processors to be clustered 
together and share uni forrrL y short i nt er connecti on pat hs amongst therrselves. All these 
bidelta clusters are theni nt er connect ed through the fat- tree network al 1 owi ng locality tc 
be expl oi ted bet ween adj acent cl ust er s whi 1 e rraki ngitpossiblefor any pai r of processor to 
commni cat e i n a moder at el y effci ent manner . 

The t eriri nal clusters are bui It in stacks rmch 1 i ke the st andar d Ttansi t bi del t a net- 
work stacks . The onl y di fference i s that abideltacluster stack mist have some bandwi dth 
between itself and the fat- tree network rather than dedi cati ng al 1 bandwi dth i n and out 
of the stack to processors . Thi s can be done in a s trai ght forward manner as shown di a- 
gramrati cal 1 y i n Fi gur e 2.1. The processors connect to the bi del t a net work, but i nst ead of 
consuiringall of the bandwi dth i nto the fir st stage of the bi del t a net work, they cons urre 
onl y three- fourth of the bandwi dth. The first routi ng stage routes i n four 1 ogi cal di recti ons 
as before. However , onl y three of these di recti ons route to further switchingst ages i n the 
bi del t a cl ust er . Che of the 1 ogi cal desti nati ons out of the fir st routi ng stage connect s di rect 
to the fat -tree net work. Thus , the remai ni ng routi ng stages wi 1 1 onl y be three- quart er s the 
size of the first routing stage since one- quarter of the interconnect bandwi dt h 1 e ave s the 
stack after the first routing stage. 
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Fi erure 2.1: Hdelta Quster at Leaves of Fat- Tr ee 



2. 1. 1 .Al 1 ocat i on of Buidw dt h t o and f romlat - Tree 

In the nanner described, the fat -tree network consumes one- quart er of the t ot al band- 
wi dt h i nt o and out of the bi del t a r outi ng st ack. Q earl y, the one- quart er of the bandwi dth 
f r omt he first stage of routing to the fat-tree net wor k mis t c one f r omal 1 t he r out i ng com 
ponent s inthe first stage. Nai vel y, it woul dbe possible to connect the bandwi dth back from 
the fat-tree network to the bi delta cluster to any of the inputs of the first stage routing 
components as all these inputs are logically equivalent; however, it turns out that many 
possible configurations prove not to be optimal. 

Che extreme woul d be t o connect al 1 the i nput to the bidelta cluster f r omt he fat- tree 
network t o al 1 of the i nput s of one- quarter of the routi ng el errent s i n the first stage. Thi s 
woul d be non- opti mal i n terms of f aul t - 1 ol e ranee because an unf or tunat el y pi ace d si ngl e 
chip failure in this first routing stage could eliminate 8 of the inputs fromthe fat-tree 
network. In terns of routi ng, thi s woul d be wast ef ul of bandwi dthtothe fat -tree network. 
The output s fromthe first routi ng stage to the fat -tree network that come fromthe set of 
routi ng component s withall their i nput s ori gi nati on fromthe fat - tree net work woul d be 
unus e d. Thi s results bee aus e there is noreasontoroute a c onne c t i on t hr ough a 1 e af no de . 
CI earl y, thi s extreme i s undesi r abl e. 

The al t ernat e extreme is to distri but e the bandwi dth fromthe fat- tree network t o the 
cluster over all the routing components in the first routing stage. In this scheme, each 
routi ng chi p i n the fir st stage has 2 of i t s 8 i nput s connect ed to the fat -tree network. A 
si ngl e chi p f ai 1 ur e wi 1 1 al ways reduce the bandwi dth fromthe fat- tree network t o the bi del t a 
cl us t er by t wo. Through each rout i ng component at most si x of the i nput s will need to 
be switched in the direction of the fat-tree network. Qven even connection frequency 
distributions across all processors of requests requi ri ng the fat- tree net work, each pair 
output s to the fat -tree will be equal 1 y 1 oaded. 

Abrief consi derat i on of alternatives between these two extremes makes it clear that 
the later extreme is the most equitable distribution. Any configuration between these 
two extremes would make it more likely for some processors to be able to make a non- 
local connection than others. Similarly, any scheme other than the later extreme would 
necessarily lose more bandwi dth due t o a worst- case si ngl e chi p f ai 1 ur e. Thus , the most 
equi t abl e configur ati on i s the one i n whi ch al 1 the bandwi dthbetweenthe fat -tree net work 
and thebideltacluster is evenlydistri but ed over al 1 the routi ng component s compos i ng the 
first stage of routi ng i n the bi del t a cl ust er . 

2.1.2 lault Tolerance 

The bideltaleaf cluster is configured for f aul t t ol e ranee i n t he same manner as Tr ansi t 
bi del t a net works ( Sect i on 1. 5) . The pair of inputs fromeach processor s should be dis- 
tri but ed to di fferent routi ng component s . Al but the final stage of rout i ng i s provi ded by 
BN1 component s i n t he st andar d 4 crossbar configur at i on wi th 2 output s i n each 1 ogi cal 
direction. The final output stage is construct ed wi th the BN1 components configured in 
the alternate manner such that each pr oces sor receives its pair of net work output s from 
two di sti net routi ng component s . 
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Figure 2.2: Logical Topol ogy of Quat e r nar y Tr e e 

2.2 lat - 1-ee 1>pol ogy 

2. 2. 1 Quaternary Tree 

The appropri ate fat -tree to construct usi ng BN1 i s 1 ogi cal 1 y a quaternary fat- tree (Fi g- 
ure 2.2) as opposed to the bi nary fat- trees preval ent i n t he 1 i t erature. Thi s is the natural 
selectionsince our routi ng conponent is a 4 X 4 crossbar switch. Usi ng less than four di rec- 
ti ons woul d re qui re over spec i f yi ng every routi ng path through the net work si nee two 1 ogi cal 
output port s from a routi ng chi p woul d actual 1 y be goi ng i n the sane 1 ogi cal di recti on. 
Si ni 1 arl y, constructi ng a tree vdtha branchi ng fact or great er than four woul d re qui re mil - 
ti pi e stages of routi ng conponent s t o construct each vi r tual routi ng stage. The quat ernary 
tree, of course, has fewer levels of routi ng f or a gi ven net work size than a conpar abl e bi nary 
tree. 



2. 2. 2 Separate Up and Ebwi Trees 

The 1 ogi cal structure of a fat -tree as describedin[Leiserson85], [Geenberg85], and 
el se where i nt e grates up and down routi ng i nt o a si ngl e tree. For constructi on pur poses based 
on the Transit routing components and t echnol ogy, it is preferable to actually construct 
this logical structure wi th the up and down rout i ng trees separated. Lateral connections 
are provi dedbetweenthe two routi ng trees at everylevel toall owconnecti ons to cros s- over 
f romthe up routi ng to the down routi ng tree as soon as appropri ate. 



2.2.3 Structure 

Thre fat -tree structure will be construct edinthree-dinensi onal space. Aseparat e pi ane 
is allocatedfor eachtree level. Che way toviewthis, is to take the 1 ogi cal structure showr 
i n Fi gures 2. 2 and r ai se each i nt ernal tree node up to a pi ane equi val ent toits hei ght i n the 
tree. Fi gure 2. 3 shows t hi s t hree- di rrensi onal nappi ng of the tree structure. 
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Figure 2. 3: Three- El nensi onal Vi ew of Tree Structure 

2.2.4 Ifroard Uniting 

RN1 i s a crossbar s wi t chi ng el enent . Si nee BN1 has mil ti pi e output s i n each 1 ogi cal 
di recti on, it does sons amount of concent rati on. Its pri nary strength, though, i s i n rout i ng. 
W can take advantage of thi s by routi ng i n t he up tree as wel 1 as the down tree. Eat her 
than actual 1 y goi ng through every 1 ogi cal tree 1 evel on the j ourney up the tree to the de sired 
hei ght , some routi ng i s done t o al 1 ow t he up routi ng path t o short- cut around some tree 
stages. Wththis short - cut ti ng, we don' t need an up routi ng stage for every 1 evel of the tree. 
Usi ng RN1 whi ch di s ti ngui shes four routi ng di recti ons , each up routi ng stage can route a 
connect i on to one of the next three 1 at eral cros s overs i n the tree or to the next up routi ng 
stage inthe tree. The next up routi ng stage wi 1 1 perf ormsi nil arl y. In thi s manner, one up 
routi ng stage is needed for every three 1 evel s of the tree. Fi gure 2. 4 shows a cross-secti ona 
vi ew of an upward rout er and i t s logical connections to components i n the up and down 
routi ng trees . Thi s cross-secti on spans three 1 ogi cal tree levels whi ch are real i zed as oi 
physical up routi ng stage and three physi cal down routi ng stages ; the cross- sect i on shows 
onl y a si ngl e component at each up and down routi ng st age. 

Utilizing the routi ng capabi 1 i t y of EN1 i n thi s nanner, routi ng up the fat- tree i s accom 
pi i shedi n at itds^ — stages as opposed t o th^lMgiat woul dbe re qui redi f no routi ng 
were performedinthe up tree. At the same ti me, EN1 still performs some concent rati on 
at each upward rout i ng stage but not full concent rThaitohs , at each stage inthe up 
routing tree, all inputs destined for the same parent node cannot end up on any wi re i n 



1 lei serson did point out that if the routing component were nodified slightly to allowa specification of 
"route to height x" inthe fat- tree in addition to "route in direction y" it wjuldbe possible to specify large 
heights in the fat- tree wth only a snail nurber of hits and offer nare freedomin terra of concentration 
and bandvidth allocation [Ieiserson 89]. 
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Fi gure 2.4: Gross-SectionViewof Up and Dawn. Bouti ng Trees 



the up the tree connected to that parent node. Each up rout i ng stage al 1 ows a gi ven i nput 
wi re to be connected to any of tw) vires ineachlogical di recti on when wiredproperly; a 
concent rati on of t wo i s effect i vel y ac hi eved at each upward routi ng stage. The best known, 
means of achi evi ng f ul 1 concent r ati on at each s t age wi 1 1 cost u?l og i^ ti me and hardware 
[ Aj t ai 83] [ Gormen 86] whereas BN1 performs its linited concent rati on i n constant ti me 
and hardware. In thi s nanner , the concent rati oninthis fat -tree configurat i on i s nuch 1 i ke 
the fat -tree with const ant size switches of Leiserson and Greenberg [ Greenberg 85] shown i n 
Fi gure 1.5. The number of si gnal s i nt o each 1 evel up the fat- tree does grow be cause fan- i n 
occurs fromnore and more stages. 

Fi gure 2. 5 shows what the up tree connect i ons 1 ook 1 i ke i n three di mensi ons . The routi ng 
component s are at the base. The connect i ons to further up routi ng stages are shown str ai ght 
up f r omeach routi ng component . Fi gure 2. 6 shows hori zont al cross- secti ons of Fi gure 2. 5 
to make the convergence and routing clear. Layer is the layer of routing components. 
Layers 1 through 3 show where the 1 ogi cal connect i ons fr omeach routi ng component fan- i n 
to the lateral connecti on of the next three cros sover stages . The fourth 1 ogi cal connection 
out of the rout i ng component s , of course, connects di recti y upward to the next up routing 
stage. 



Qiick Ifauting for Very Large System 

1 1 i s actual ly possible to do better t lij|n t hteages of up routi ng j us t des cri bed. 
By usi ng a seri es of routi ng stages to switcha connecti on to the naxi numhei ght up the 
tree to whi ch i t needs to be routed, the connection to the appropriate lateral crossover 
between the up and down routi ng tree can be made i n cbfogaunber of levels in tree. 
Si nee there are \(p$ levelsinthetree, this means ordlpjIN^ stages are needed 
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Figure 2. 5: Three- El nensi onal Viewof Connections f romOne Up Bouti ng St age 

t o perf ormupwar d routi ngl.n esse nee, this routi ng scheme bui 1 ds a bi del t a net work from 
the processors to the vari ous treelevels. Fi gure 2. 7 depi ct s how an bi del t a network can be 
used to reach any hei ght ina tree of depth 16 in onl y 2 stages of routi ng. 

However, thi s up routi ng scheme i s onl yofinterest when the fat- tree has a 1 arge nunber 
of tree st ages . Inessence, weareal ready get ti ng the benefit f romt hi s s cherre achi evabl e 
when the size is on the order of 4 1 evel s . The poi nt where i t actual 1 y becomes interesting 
to use this scheme occurs when the nunber of 1 evel s is roughl y 4 x4 =16. In our s cherre 
however, 16 tree levels i npl i es a network suppler bi dgl4a clusters eachwith48to 
192 processors. Thus even in the the smallest case we will 'nacre «Jbo4it 200 
billionpr oces sor s . The capaci t y anal ysi s and geometry re qui rement s for t hi s case are not 
consi de red here si nee this is clearly not a scheme of current pract i cal i nt erest . 

2. 2. 5 Ebwi R>ut i ng 

The down routi ng pat hi s si npl y the str ai ght- forward tree structure. Each down routi ng 
stage switches in four 1 ogi cal di recti ons among its four sub- trees . The 1 ogi cal down routi ng 
path 1 ooks j us t 1 i ke the 1 ogi cal tree structure shown i n Fi gure 2.3. Mist of the i nput s t o a 
down routi ng stage come f romi t s parent in the fat -tree. The rest of the i nput s come from 
thelateral cross- over f romt he up routi ng tree as descri bed i n the previ ous sect i on. 

2.2.6 Efes i red Ckpacity Growth Ifete 

One of the ni ce proper ti es whi chf at -trees can have i s vol ume- uni ver sal i ty. Thi s prop- 
erty essentially states that a fat -tree fat -tree caneffcientlysi mil ate any other network oi 



3 N.B. It wll alrays take this rany stages regardless of the height routed by such an up routing tree. 
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Figure 2. 6: Horizontal Cros s- Secti ons Che Up Eouti ng St age 

corrparabl e vol une vi th at most a pol yl ogari thrri c si owdowri [ Lei ser son 85] . Addi ti onal 1 y, 
vol une- uni ver sal i ty i rrpl i es that 1 arger fat- trees can be created by si rrpl y addi ng 1 evel s and 
s cal i ng t he fat- tree structure i n the three- di mens i onal worl d. 

To assure vol ume - uni ve r s al i t y, the bisection bandwidth mist be properly correlated 
with the volumes cont ai ned wi thi n each por ti on of the network. For three dimensional 
structures , bandwi dth i nt o a vol ume is general 1 y pr opor ti onal to the surface area of the 
encl osed vol ume; t hi s vol ume wi 1 1 i n turn be pr opor ti onal t o t he nurrber of t erninal nodes 
or network end poi nt s f r omwhi ch t he vol ume i s corrposed. Thus we assume: 

• \ol ume =v (x network endpoi nt s . 

• Bandwi dth oc Surface area =a 
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Fi gure 2.7: Hybri d Up Bouti ng Scheme wi tljlog JVSt ages i n Up Bout i ng Tree 

• Surface area =aotyJ(v) 2 

Usi ng was sone measure of the number of network endpoi nt s , and aas some measure of the 

network bandwi dth, the i nport ant r el at i on i $/v)3t (If we assume that the enclosed 

volume increases by a factor of n 4=(4ro n _i) at each level, as one night expect for a 

quat ernary tree, it is clear that vol ume uni ver sal 1 i ty can be achi eved onl y i f the bandwi dth 

i ncr eases no f aster than the fact or deri ved i n Equat i on 2. 1. 

a n oc {1/T n f =( yi^~T) 2 =( ^Tf{Vl) 2 =a n4 (#4) 2 =a n4 #16 (2.1) 

Thus, on average, bandwi dth shoul d i ncrease by a f a#16ra6feach successi ve 1 evel up 
the tree to naxi ni ze the bandwi dth avai 1 abl e i n the network whi 1 e keepi ng the growth rate 
appropriately bounded. Wether this bandwi dth i ncr ease can actual be realized within 
the assumed factor of four growth rate is still an open quest i on. Agener al i zati on of the 
uni ver sal 1 i t y proofs i n [ Lei ser son 85] and [ Geenberg 85] to thi s rate of growth nay be 
possible, but has yet to be demonstrated. 

2.2.7 Channel Capacity G*owth 

It is easy to see the average channel capacity, nunber of distinct physical connections 
i n or out of an internal tree node, growth by looking at the channel capacity growth 
across three tree levels. For si irpl i ci t y, consi der the porti on of a fat - tree network shown i : 
Figure 2.6. The channel capacity of each logical channel i n and out of the bottomof this 
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Table 2.1: Up Tr e e Bandwi dt h M 1 o c at i on 

structure i s ei ght . The channel capaci t y of the si ngl e 1 ogi cal channel i n and out of the top 
of this structure is 128. Thus, the channel capacity has grown by a fact or of 16 across 3 
tree 1 evel s gi vi ng the desi red average gryoMJi of 

Si nee the up routi ng and down rout i ng trees are separate, the channel capaci t y growth 
i n the up tree di ffer s s one what f romthe capaci ty of the the corr espondi ng net work stages 
i n the down tree. 

Ifj "free 

To see howthe up tree routing capacity grows, it is easiest to look at the channel 
capaci ties for the first f ews t ages of the network. Oonsi der the por ti on of the up routi ng tree 
shown i n Fi gur es 2. 5 and 2.6. Layer si npl y perform; the routi ng to the next three 1 ayer s 
so i t does not correspond to an act ual levelinthetree. Each base routi ng corrponent i nl ayer 
i s an BN1 routi ng chi p. Thus each of the 64 routi ng corrponent s shown i n Fi gur e 2. 6(A) 
has 8 i nput s and output s . The 8 output s are di vi ded i nt o four 1 ogi cal di recti ons wi th t wo 
goi ng to each of the next three 1 ayer s and 2 goi ng further up the tree. Thus 1 ayer 1 has 64 
pai r s of output s conver gi ng to 16 di fferent sub- trees (Fi gure 2. 6(B)); theresult is 161 ogi ca 
channel s of capaci ty 8. Si ni 1 arl y, 1 ayer 2 has 64 pai r s of output s convergi ng i n 4 sub- trees 
rraki ng 4 1 ogi cal channel s each of capaci ty 32 (Fi gure 2. 6( C) ) . Fi nal 1 y, 1 ayer 3 has 64 pai r s 
of output s convergi ng t o a si ngl e sub- tree whi ch gi ves a si ngl e 1 ogi cal channel wi th capaci t y 
128 (Figure 2.6(D)). This leaves the r enai ni ng 64 pai r s of outputs whi ch connect to the 
next routing stage as a si ngl e 1 ogi cal channel wi th capaci ty 128. This logical channel will 
then encounter another routi ng stage j ust 1 i ke 1 ayer and the pr ogres si on wi 1 1 conti nue i n 
thi s nanner . Thi s capaci ty progr essi on i s sumrari zed i n Tabl e 2. 1. The final item inthe 
t abl e are the corr espondi ng 1 ayer and 1 progr essi ons for the next routi ng stage. Thus the 
progr essi ons of growth fact or s is 1, 4, 4 gi vi ng a capaci t y growth of 16 in 3 tree st ages or 
an average growth o$ / T6. 

Ebwiward 

The bot t omnDS t down routi ng 1 evel mist provi de 8 output s ineachlogical desti nati on 
to seal e properl y wi th i nput to the up routi ng tree. Thi s di ct at es 8 • 4=32 output s or 4 
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Tree Level 


Qipacity 


n 


8 


n+1 


24 


n+2 


64 


n+3 


128 


n+4 


3% 



Table 2.2: Ebwn Thee Bandwi dth Al 1 ocati on 

r out i ng c hi ps . Thi s i npl i e s t hat t he t ot al f an- into this level is also 32. 2- 4 =8 oft ha 
fan- i n i s consumed by 1 at eral connect i ons 1 eavi ng32— 8 = 24 i nput s whi ch mist fan- i n 
f romabove. At the next 1 evel , there need to be 24 output s for eachlogical desti nati on as 
just deternined. This dictates 24- 4 =96 outputs or 12 routi ng component s and a total 
fan- i n of 96. There are 2 • 4- 4=321ateral fan- i n si gnal s tot hi s level, 1 eavi ng96 — 32=6' 
i nput s f or fan- i n f romt he previ ous down routi ng stage. Si nil arl y each 1 ogi cal desti nati on 
at 1 evel three has 64 output s associatedwithit. Thi sdictates64- 4 =256 tot al output s 
or 32 routi ng chi ps . Thi s gi ves a t ot al fan- in to t hi s level of 256. The 1 at eral fan- i n i 
2- 4- 4- 4 =128 leaving 256 - 128 =128 inputs for fan-in fromabove. 

Beyond this point, the structure replicates. The next level up 1 ooks like the bottom 
level seal ed so that the i nput size in each di recti on mat ches the output f romt he previ ous 
stage, pr ogres si on nay continue in this nanner ad infiritum. The pr ogres si on of channel 
capaci ties inthe downward route are thus as shown i n Tabl e 2. 2. Level n+4 i s equi val ent 
to the level rastage since the series begins to repeat at that point. Thus the progressions 
of growth fact or s i §,32, gi vi ng a capaci t y growth of 16 across 3 tree levels. Thi s agai n 
gives the desi red average growt$T6f This is identical to the growth in the upward 
routi ng tree, but the growth does not coi nci de wi th t he upward tree on a stage per stage 
basi s . 

2.3 Wiring Constraints for Efficient Bandwidth Ustri but ion 



In both the up and down routi ng porti ons of the fat- tree network, a 1 ogi cal routi ng stage 
i s conposed of nany separate routi ng conponent s . The i nput s t o any 1 ogi cal routi ng st age 
come f romdi fferent logical directions; that is the inputs to the logical routing stage nay 
be lateral crossover connections f romdi fferent subtrees or nay come fromthe immediate 
parent routing stage. The distribution of these inputs to physical routing conponents 
present s a more general case of the i nput bandwi dth al 1 ocati on di s cussed in Sect i on 2. 1. 1. 
In thi s secti on, the i ssue of bandwi dth di s tri buti on to routi ng conponent sis exanined i n 
more det ai 1 i n order t o devel op some constrai nt s for the general pr obi emof opti mal 1 y wi ri ng 
the fat- tree network. 

Agai n, we start by exam ni ng the t wo extreme cases for bandwi dth di stri buti on. I use 
the word dspersicnto refer to the degree to whi ch i nput s f romdi fferent 1 ogi cal di recti ons 
are di stri but ed across distinct routi ng conponent s. Llspersionis closelyrelatedtothe 
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Figure 2. 8: Nan- El s perse Exanpl e 



noti on of expansion! n the theory 1 i t erat ure. 

2.3.1 N) El spersion 

In the non- disperse case at each such 1 ogi cal routing stage, the inputs froma given 
di recti on are each rout ed to their own. set of routi ng conponent s . That is, if the i nput s 
f roma gi ven di recti on make up a f racti on a of the i nput s i nt o that 1 evel , then these are al 1 
destined for an equivalent fraction of the routing conponents at that level. Oonponents 
will be shared bet ween 1 ogi cal input directions only in the case that there is an uneven 
division of conponents ammg t he i nput directions. 

At any stage, i nput s froma gi ven di recti on can onl y make use of aof the bandwi dt h 
to each logical destination. This means the inputs fromthat stage can, at must, reach 
aof the routing conponents at the next level. This effect i vel y irini ni zes the nurrber of 
different conponents, and hence paths, reachable; This is non-optinal for f aul t - 1 ol erance 
and ni ni ni zi ng routi ng conges ti on. Ch the posi ti ve si de, at 1 east aof the bandwi dt h i n 
a gi ven di rect i on i s guarant eed t o the i nput s f r omeach i nput direction; si nee the i nput s 
froma si ngl e 1 ogi cal di rect i on do not , i n general , share routi ng conponent s wi t h i nput s 
from other directions, those inputs are guarant eed the entire aof the bandwidth. Nan- 
dispersion means that at each level, inputs fromthe same direction are conpeti ng onl y 
wi th each other for the bandwi dth t o the next 1 evel . 

Exanpl e Oonsi der Fi gure 2. 8 for a concrete exanpl e of thi s wi ri ng s trat egy. Thi s s wi t ch- 
i ng coul d represent Level 1 of the fat- tree i n the downwar d r out i ng path. The 1 ogi cal routi ng 
s t age i s thus conposed of four routi ng conponent s as shown. Inthis case, 8wires enter from 
t he 1 at er al di rect i on whi le 24 enter fromthe pr evi ous down, routi ng stage i n the downward 
routi ng t ree. Fi ght wi r es 1 eave level lineachofthe41 ogi cal di recti ons . 

Here, the 32wires are parti t i oned based on thei r di recti on of ori gi nat i on. The 8 1 at eral 
wires all gointoasi ngl e routi ng chi p. The 24 wi res f romf art her up the tree are connected 
t o the rerrai ni ng 3 routi ng chi ps . Of the 8 1 at eral connect i ons through t hi s 1 evel , at most 
t wo connecti ons can be made i n each 1 ogi cal output direction; however, i f more than two 
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Fi gure 2. 9: El s perse Exanpl e 

1 at eral connect i ons need to be nade i n a parti cul ar di recti on, at 1 east 2 of the connect i ons 
wi 1 1 be nade. Asi ngl e chi p f ai 1 ure at t hi s 1 evel woul d ei t her prevent 1 at eral conrmni cati on 
enti rel y or cut down the bandwi dt h f ronif art her up in the tree by 33% 

2.3.2 Rill Els per si on 

Ful 1 di spersi oni s achi eved when the i nput wires toeachlevel f roma gi veni nput di recti on 
are spread out amongst as nany different routing components as possible. The pair of 
output s f romt he same di recti on of a si ngl e chi p at a previ ous 1 evel al ways go to di fferent 
routing components. 

The nunber of potential paths frompoint to point expand at eachlevel. This results 
because each i nput wi re f roma 1 ogi cal i nput di recti on has the opportuni ty t o 1 eave on any 
of two output s ; at the same ti me, when a connect i on f r oma 1 ogi cal i nput di recti on does exi t 
the r outi ng 1 evel vi a a parti cul ar output port , i t i s not di irini shi ng the bandwi dth pot enti al 
for any other inputs fromthat sane logical direction. Thus inputs fromall directions 
corrpet e for the bandwi dth to the next 1 evel . To a 1 i ni ted extent , thi s al 1 ows aver agi ng of 
the 1 oad f romsever al input directions across the total available bandwidth. Similarly, the 
effect s of component fai 1 ure i s spread evenl y over al 1 logical input directions. Wen a chi p 
fails, it effectively reduces a f r acti on of the fan- i n f romeach 1 ogi cal di recti on. 

Exanpl e Figure 2.9 gives the concrete exanpl e for the full di spersi on case. This cor- 
responds to the di sper se wi ri ng of the same 1 evel 1 downward routi ng path gi ven in the 
example of the previous section. 

Inthis case, the 32 wires are distri but ed evenl y, based on di recti on of ori gi nat i on, acros 
the 4 routi ng chi ps . If al 1 8 1 at eral connect i on were dest i ned f or the same output di recti on, 
there is apossibility that they coul d al 1 be routed; there is also the possibility that th 
coul d al 1 be bl ocked because al 1 the bandwi dth i n that parti cul ar di recti on i s consumed by 
connect i ons farther up the tree. Asi ngl e chi p fai 1 ure at thi s 1 evel woul d effect i vel y cut t h< 
bandwi dth out of this level down by 25% 
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2.3.3 Ibtential RithG-ovtfh 

Fromthe above, we see that f ul 1 di s per si on wi 1 1 naxi ni ze the nunber of paths avai 1 abl e 
for a connection through the network. This factor of two increase at each level in the 
nunber of paths over whi ch a gi ven connect i on can be made i s the best achi evabl e usi ng 
BN1. Thi s occurs when each wi re on whi ch a connecti on coul d pot enti al 1 y be nade by 
a conrnm source i s connected to a di fferent routi ng conponent . There are then t wi ce as 
nany pot enti al paths toward the de sired desti nati on out of each routi ng 1 evel as there were 
i nt o the 1 evel . As 1 ong as thi s property can be nai nt ai ned, each routi ng 1 evel wi 1 1 be abl e 
to achi eve thi s effect i ve doubl i ng of the nunber of paths between a given source and its 
desti nati on. 

Certainly, it is possible to wire the network with less potential paths between each 
source desti nati on pai r . If two or nor e wi r es whi ch a gi ven source coul d have rout ed a 
message through are connected to the same routing chip, then the path growth will be 
snal 1 er . 

In or der to guar ant ee that we can achi eve the naxi rial path growth or path expansion 
descri bed above, at each 1 evel , it need onl y be the case that there is not a si ngl e set of 
wires that can be reached froma single processor that accounts f or |m>fettilaan 
total bandwi dth i nt o any 1 evel . Thi sis not the sane as sayi ng that t he i nput s froma si ngl e 
logical di recti on mist compose 1 ess| teKanhe i nput s t o a gi ven 1 ogi cal routi ng 1 evel . 
The i nput s froma gi ven 1 ogi cal di recti on are not necessari 1 y f ul 1 y concentrated when they 
enter a routi ng 1 evel . As 1 ong as the constr ai nt gi ven above on the f ract i on of pot enti al 
i nput s froma si ngl e source can be nai nt ai ned, each i nput wi re froma pot enti al 1 y comron 
source can be routed through a di fferent routi ng conponent at any 1 evel in quest i on. For 
the fat- tree routi ng struct ure descri bed here, thi s propert y i s nai nt ai ned at each 1 ogi ca! 
routing level conposi ng the net work. 

Lei ght on and Miggs [ Lei ght on 89- 2] point out that the above constraint is sufficient 
onl y to guarantee opt i nai path expansi on at the ni cro- 1 evel ; that is when 1 ooki ng at the 
nunber of paths betweena si ngl e source and desti nati on. Wen vi ewi ng path expansi on at 
the macro- 1 evel , additional constrai nt s are neces sary to guarantee opti nai path expansi on; 
macro- level path expansi on i s the growth in the nunber of paths between sets of source 
and desti nati on proces s or s rather t han s i npl y i ndi vi dual processors. Lei ght on and Mggs 
demonstrate the advant ages of optinal or ne ar - opt i rial macro-level expansi on wi ri ng when 
with a scheme where the switching elements can arbitrate with one another concerning 
net work 1 oadi ng [ Lei ght on 89- 1] . BN1, however, selects output paths obliviously. As such, 
i t i s not cl ear that thei r nacro- 1 evel constrai nt s wi 1 1 have any si gni fie ant affect i n pract i c 
on the routi ng per for nance of a network bui It usi ng EN1 switches. Eecent work [ EeHbn 90] 
woul d i ndi c at e that this ki nd of expansion is useful t o naxi ni zi ng the fault tolerance of 
the network. 

The conplexity of Lei ght on and Mggs' switching element [ Lei ght on 89- 1] is mich 
greater than that of BN1. As such, it would certainly represent a mich slower compo- 
nent . Addi ti onal 1 y, thei r s wi t c hi ng scheme re qui res approxi nat ely4(logi\[clockcycles to 
rout e a connecti on as opposed to the (logi^ clock cyles re qui red by BN1. Both of these 
proper ti es nake its routi ng mich si ower than BN1. However , i f i t s routi ng efffci ency turns 
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out to be great er than that of the obi i vi ous rout i ng done by BN1 by a conparabl e factor, 
then perhaps i t woul d be bene fie i al t o 1 ook further i nt o the al t ernati ve of constructi ng such 
a s wi t chi ng el enent . 
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Cbnst ruction 



This chapter expands on the technical details i nvol ved i n the cons tructi on of the network 
structure describedinthe previ ous chapt er. Section3. 1 discusses afewfurther issues about 
the const ructi on of the bi del t a cl ust ers at the 1 eaves of the fat- tree. Sect i on 3. 2 addres s 
a coupl e of i ssues i nport ant to the real i zati on of a such network usi ng BN1. Sect i on 3. 3 
i ntro duces the wit tree, a stack structure that can serve as the basi c bui 1 di ng bl ock f or thi s 
ki nd of fat- tree network. Then Section3.4describes apossible geonet ry for arrangi ng these 
uni t tree structures in three- di mens i ons to realize networks of r easonabl esize. Section3. 
f ol 1 ows descri bi ng cons tructi on i ssues for thi s geonet ry. Section 3.6 then deals with the 
1 ong wi res that necessarily occur inthis structure. Section 3.7 briefly details processc 
placement withrespects to sue ha network. Fi nal 1 y, Sect i on 3. 8 det ai 1 s an effbi ent scheme 
for conputi ng rout i ng sequences for this network. 

3. 1 H delta Leaf Ousters 

Sect i on 2. 1 descri bed the conposi ti on of the bi del t a 1 eaf cl ust er s . Thi s sectionfills i 
cons tructi on det ai 1 s not readi 1 y apparent fromthe previous description. 

3.1.1 Gbnnect i ons To lat - Tree 

For si npl icity, the 11 1 ogi cal routi ng di recti on out of the first st age of routi ng i n the 
bideltacluster is chosento rout e into the fat -tree network. That is, if the first two routi n 
bi t s of the routi ng sequence are both ones , the connect i on wi 1 1 be routed out of the bi del t a 
cl us t er and i nt o the fat - tree net work. The choi ce of whi ch 1 ogi cal di recti on to use for thi s 
purpose i s somewhat arbi trary, but it is useful to settle on a si ngl e di recti on for referenc 
purposes . 

3.1.2 Gbnnect ions RomEat-Tree 

The first stage routi ng component s wi 1 1 be configured to s wal 1 owthe first routi ng byte 
they encounter, as describedinSectionl.3.2, for clusteri nput s fromthe fat- tree network 
This means the s wal 1 ow property mist be set only on the two inputs fromthe fat-tree 
network to each routi ng component i n t he first routi ng stage; the other si x i nput s to each 
routi ng component wi 1 1 come di recti y f romprocessor s and not re qui re t hi s s wal 1 owpr oper t y 
to be set. Inthis manner , afresh routi ng bytes i s al ways used to route through the bi del t a 
cl us t er r egar dl ess of the source of the connect i on. Thi s provi des a 1 ogi cal separati on bet wee ] 
the port i on of the routi ng sequence used to rout e through the fat- tree and that used to rout e 
withinthe cluster naki ng routi ng cal cul ati on mode rat el y easy. It al so guar ant ees that a 
fresh set of routi ng bi t s is avai 1 abl e t o route through the bi del t a cl us t er . 
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3.1.3 Size 

The nunber of processors that a si ngl e bideltaclusters stack can support i s cons trai ned 
by physical size, bandwidth, and gr anul ar i t y. Since the routing conponent distinguishes 
four distinct directions, each additional routing stage will increase the size, nunber o 
processors, by a factor of four. Nate, however, that since one logical direction out of the 
first stage connect s to the fat -tree rather than the bi del t a network, that the first stage onl y 
di st i ngui shes three di recti ons wi thi n the net work. Thi s granul ari ty cons trai nt al one speci fie 
the const ruct i on of clusters whi ch suppbpto3ces4or s , for any non- negati ve i. Wen 
consi deri ng physi cal size we are linitedbythe size we can construct the hori zont al r outi ng 
boards (seeSectionl.4). Consi deri ng the proj ect ed si ze of BN1, the largest possible si ngl < 
hori zont al routi ng pi ane we can construct wi 1 1 support 8 X 8 =64 routi ng conponent s ; t hi s 
woul d al 1 ow t he cons t ruct i on of the 4 stage network support i=ng92 -processor s . 
Addi ti onal ly, the bideltacluster will have to provi de bandwi dth t o and from the fat- tree 
network whi ch wi 1 1 nat ch that of the fat- tree network bei ng constructed. 

For notational convenience, a bi del t a cl ust er s support i ng raprocessors will be denoted 
by B n . eg. a 3 st age bi del t a cl ust er s wi 1 1 sup^or^ttSB processor s and wi 1 1 thus be 
referenced as^sB 

3.2 Inter- Stage Efet ails 

3.2.1 1|j Rit h II rect i ons 

Bout i ng conponent s i n an up routi ng st age route i n four distinct di recti ons as shown 
inFigure2.4. Here, again, t he as si gnnent of actual rout i ng di recti ons to each of the four 
1 ogi cal di recti ons i s some what arbi t rary. It is useful , though, t o nake thi s as si gnrrent i n a 
1 ogi cal manner for the sake of references and routi ng si rrpl i ci t y. The 11 di recti on is as si gnec 
to the di recti on whi ch routes further up the tree. Thi sis consi stent withthe choi ce of the 11 
di recti on to route out of the bi del t a cl ust er . The 00, 01, and 10 di recti ons eachrespectivel 
rout e to the each of the next three successi ve down routi ng stages . That is the 00 di recti on 
rout es to the next down routi ng st age, 01 to its parent , and 10 to the next parent . 

3. 2. 2 Swal 1 ovb 

In or der t o al 1 ow routi ng to arbi t rari 1 y nany desti nati ons , we need to make sure that 
the fat- tree network i s configured to di scar d an ol d routi ng byte when i t i s exhausted. Thus 
the svdlowst ages need t o be arranged i n a functi onal manner. This task is complicated 
by the fact that connect i ons can be made through arbi t rary hei ght s in the tree such that 
corrpl et e paths f romsour ce to desti nati on are not enti rel y homogeneous ; that i s , paths wi 1 1 
di ffer i n 1 engt h based on where they ent ered the down routi ng tree. 

Ideal 1 y we woul dl i ke each of the s wal 1 owst ages to be pi aced a a naxi mimdi stance from 
each other. Thi s is desi rabl e because each s wal 1 owst age effectively costs one clock cycle of 
del ay whi 1 e the ol d routi ng byte is bei ng di scar ded so the new can take its pi ace. Thus t o 
niniirize the time to make a connect i on, s wal 1 owst ages shoul d be pi aced as i nf requent 1 y 
as possi bl e. 
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Si nee there are 8 bi t s i n each rout i ng byte and each routi ng stage consumes 2 bi t s , 
s wal 1 ows are re qui red at least every four routi ng st ages . Swal 1 ow stages will certainlybe 
needed at 1 east every 4 stages on the up path and on the down path. It i s pos si bl e to change 
f romup routi ng t o down routi ng at any up stage. In order to be as sure d that the down 
r out ing stages get fresh r out i ng by t e s whe n ne e de d, t he ol d r out i ng by t e will be stripped 
off as part of the lateral crossover fromthe upward tree to the downward. This routing 
byte change will berealizedbysettingthe s wal 1 owpropert y on al 1 1 at eral i nput s t o a down 
routi ng stage. Thi s 1 at eral s wal 1 owprovi des a cl ear separat i on bet ween the up and down 
routi ng porti on of a connect i on. 

3. 2.3 Ht Rotations 

Ee c al 1 f r omSe ctionl. 3 t hat t he appr opriatebitsfr omt he dat a pat hare usedtoperform 
routi ng. In order t o assure that a di fferent pai r of bi t s are used at each 1 evel , the data path 
is rotated be tween routi ng 1 evel s . Thi s poses a sinilar probl emt o the s wal 1 ow. Were as , 
i n a bi del t a network, al 1 paths i n the network are of the sane 1 ength, al 1 paths through the 
fat- tree are not . M>re i nport ant 1 y, al 1 paths are not even of the sane 1 ength modul o four. 
Even paths through a gi ven routi ng component can differ in length. Thus, some attention 
mist be gi ven t o assuri ng that the bi t s are r ot at ed uni f orrrL y through the tree network. 

Fr omt he source processor t o a gi ven hei ght inthetree, the amount of rot ati on i ncurred 
will be known. Similarly, fromthat height down t o a desti nati on, thenunber of rotations 
will be known. Wat we mist assure is that the total nunber of bit rotations modulo 
four through the network i s t he same re gar dl ess of the hei ght at whi chthelateral cros sover 
occurs. In order to guarantee thi s , we mist use the 1 at eral crossover bet ween the up routi ng 
tree and the down routi ng tree to shuffle the bi t s i nt o a consi stent st at e; that i s make sure 
that the 1 at eral bi t s i nt o a gi ven down routi ng 1 evel have the same rot at i on appl iedto them 
as those coning i nt o the down routi ng 1 evel f romabove. Thi s is closelyrelatedtoneedto 
s wal 1 owon the 1 at eral crossover i n or der t o be synchroni zed with the poi nt of ent ry i nt o 
the downward routi ng path and si ni 1 arl y guarantees that, regardless of the history of a 
connection's path, it merges consi st entl y i nt o the downward tree. 

3.3 Uiit Ti-ee 

In bui 1 di ng the fat - tree network, we are constr ai ne dint he size we canreli abl y f abri cat e 
the component stacks. Eecall fromSection 1.4 that the horizontal pc-boards are United 
t o about two feet square. There are al so other di ffbul ti es that arise if we att errpt to bui 1 d 
1 arger st acks . The on- board wire lengths beginto become a more seri ous concern si owi ng 
the clock cycle of each routi ng st age. The al ready dense wi ri ng on the pc- boards becomes 
proporti onal 1 y more severe. 

It is clear that the fat- tree network wi 1 1 have t o be packaged i n mil ti pi e component 
stacks . Fromt hi s r eal i zati on, it is necessarytodeternine how to separate the fat- tree i nt i 
stacks and howeach cons ti tuent stack i s composed. W wi sh t o avoi d bui 1 di ng many stacks 
of di fferent geomet ri es , and hence needi ng to desi gn and f abri cat e many di fferent pc- boards 
and other component s . W want as much as pos si bl e to be reus abl e when seal i ng to 1 arger 
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and larger sized fat -trees. Essentially, we need a standard re pi i cabl e structure that cai 
serve as a bui ldingblockfor fat -trees in nuch the sane way that a set of i dent i cal rout i ng 
conponent s can be used to construct a routi ng st ack. 

3.3.1 lidt Tree Structure 

The wit treei s a si ngl e stack structure f romwhi cha fat -tree i nt er connect i on of a w de 
variety of sizes can be built. This basic building block, when replicated and properly 
arranged will realize the fat -tree structure describedherein. 

Size 

As not ed, current t echnol ogylinits the size of asi ngl e hori zont al pc- board routi ng 1 ayer 
to about 2x 2'. Qventhe target size of BN1 a's XL.1L4", i f we leave an equal amount 
of space between each routi ng conponent for routi ng wi res , we can pi ace about 8 X 8 =64 
routi ng conponent s i n a si ngl e routi ng 1 ayer. As such, a uni t tree is constrai nedtothis size 
for the present ti ire. As t echnol ogy i nproves , the nunber of conponent s per 1 ayer wi 1 1 , no 
doubt , i ncrease; however, this sc he ire is veryscal abl e and can easily be nodi fie d to take 
advantage of i nproved t echnol ogy. For the renai nder of t hi s paper , 64 routi ng conponent s 
wi 1 1 be as suned as the naxi numl ayer size. 

The basi c structure of the fat -tree describedinSection2.2 showed that the fat -tree is 
bui It byreplicatingaseries of one upward routi ng stage and 3 downward routi ng stages . 
These 4 stages describe the natural structure of the tree. This gives a natural division 
between routi ng 1 evel s appropri ate for i ncl usi on i n each uni t tree stack. 

The uni t t ree i s thus constructed i n 4 1 ayer s as showni n Fi gures 2.5, 2.6, and 2.3. The up 
routi ng 1 ayer is at the base and i s composed of the naxi mim64 routi ng conponent s . It i s 
f ol 1 owed by the first down routi ng stage whi chi s al so conposed of 64 routi ng conponent sin 
order to mat ch bandwi dt h wi t h the up routi ng stage. The next down routi ng stage pr ovi des 
i nput to onl y three- fourth of the first down routi ng stage si nee 1 at eral i nput s consume one- 
quarter of the input capacity to the first down routing stage. It is thus conposed of 
only 48 routing conponent s. The third, and final down routing stage provides inputs to 
the second down routi ng stage. Si nee the second down routi ng stage has one- 1 hi r d of its 
i nput s dedicatedtolateral cros sover i nput s, the third stage is onl y t wo- 1 hi r ds the size c 
the second. This nakes it one- half the size of the first down routing stage whi ch i s 32 
conponent s. Thus, the conponent count for a unit tree is shown in Tables 3.1. Each 
such unit tree thus requires a total of 208 routi ng conponent s . Wthfour routi ng 1 ayer s , 
the stack should be about"ltiall naki ng the entire s t ack roughd!'' 2: 1.5". Since 
there are several variations of this basic unit tree wort h consi deri ng, when i t is necessar 
t o di fferenti at e thern t hi s parti cul ar uni t tree will be r ef %j^nced as IT 

Charact eristics 

Wth thi s devel opment , the si ze and geometry of a basi c uni t tree is fully constrai ned. 
As such we can surmari ze the bandwi dth char act eri sties as givenin Tabl es 3.2. 
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Level 


Ibut i rig Gbnponent s 





64 


1 


64 


2 


48 


3 


32 



Tabl e 3.1: Uii t Ttee Oonponent Summry 





Total Bandw dth 


Tot al Logi cal Qiannel s 


Capacity 


Up fromLeaves 


512 


64 


8 


Down t owar d Le ave s 


512 


64 


8 


Up toward Boot 


128 


1 


128 


Down f romBoot 


128 


1 


128 



Table 3.2: Lgka Bandwidth 



Stack 
Layer 


Gbnponent 
Count 


Arai 1 abl e 
Bandw dt h 


ftqui red Bai 
EbwiBath 


idwdth 
IpPath 





64 


512 


512 





1 


64 


512 





384 


2 


48 


384 





256 


3 


32 


256 





128 



Table3.3: Lgka \ferti cal Through Bandwi dth 



Through Bandw dt h 

All of the vertical straight through connect i ons bet ween hori zontal boards inthe stack 
structure are cont ai ned on the routi ng component s ' package as describedinSectionl.3.1. 
This 1 i ni t s the nunber of vertical routing conduits available for up or downrouting con- 
nect i ons whi ch si npl y pass through a 1 ayer wi thout bei ng i nvol ved i n the routi ng occurri ng 
on that layer. Each package provides sufficient t hr ough r out i ng conduits for 8 bundles of 
9 wires apiece. No vertical vi as are needed for the downward routi ng pat h, except when 
the enti re downward bandwi dth mist route through the upward routi ng 1 evel . The upward 
rout i ng path re qui res t hr ough bandwi dt h on al 1 the down routi ng 1 evel s . Available and 
requi red bandwi dth i s sumxari zed i n t he Tabl es 3.3. Q early, there is adequate verti cal 
through bandwi dth avai 1 abl e for this scheme. 
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3. 3. 2 "free R>ot 

Two al t ernati ves exist for term nati ngthe fat -tree. It is possible si npl y t o bui 1 d t he 
tree up to a de sired size and 1 eave the bandwi dth out of the top of the topmost stage of uni t 
trees unused. Al t ernat el y, we can bui Ida cappi ng 1 evel whi chutilizes the 1 ogi cal di recti o: 
whi ch woul d have routed further up the tree to route laterally to another down routi ng 
1 evel . 

The first opti on i s the conceptual 1 y cl eanest . Al 1 of the uni t tree stacks wi 1 1 be i dent i cal 
Bout i ng wi 1 1 be i dent i cal on trees of al 1 sizes. 

The 1 at er opti on al 1 ows more size flexibilityin some cases . Knowi ngthe size of the stack 
becomes i npor t ant to routi ng consi derati ons when the root st ack i s capped i n thi s nanner . 
Stacks withacaplevel will be sli ghtl y di fferent f romother st acks so that not al 1 stacks wi 1 ] 
be i denti cal ; however, these capped stacks can di ffer onl y by the addi ti on of an addi ti onal 
routi ng 1 ayer to the top of a standard uni t t ree stack. 

Cap Level 

Conceptually, the root layer construct ed for a capped unit tree will be for the conver- 
gence of four uni t trees ; however , the component s that compose thi s 1 ayer can be distri but ed 
across the four stacks . Thi s di st ri buti on is necessary si nee we cannot nake a si ngl e stack 
larger. At the top of a s t andar d uni t tree there are 128 channel s that woul d go further 
toward the root and 128 coning f romthe root. Wthadowri bandwi dth of 128 i nt o each 
of the 4 st acks and hence 1 ogi cal di recti ons , we need a tot al of 64 routi ng component s t o 
compose the final 1 evel . One- fourth of these component s can be pi aced i n each stack. Thi s 
appr opri at el y gi ves 128 output s f romthi s newlevel to the 128 i nput s to the down routi ng 
1 evel whi chwas formerly the top of each stack. Each stack wi 1 1 ^ea S^raon- 
nect i ons through the capped root back to itself and 32 connect i ons f romeach of the three 
adjacent st acks i nt o i t s ori gi nal top down rout i ng 1 evel . Sinilarly, each stack wi 1 1 connec 
to 3 other stacks with32 connect i ons . Obvi ousl y, if the root 1 evel of uni t trees is composed 
of more than 4 stacks, the 32 connect i on in each 1 ogi cal di recti on shoul d be rraxi rral 1 y 
di stri but ed over al 1 stacks conposi ng each 1 ogi cal direction. 

For the sake of cl ari t y, thi s capped uni t tree will be ref eg^^ced as IT 

3.3.3 Building a Ikt-1-ee H-omlHt Ti-ees 

Each bi del t a 1 eaf cl ust er has a bandwi clt|f\jf i nt o the fat- tree, whqr^ Ns the 
nunber of processors in the leaf cluster. Asi ngl e uni t tree has a bandwi dth of 8 i n each of 
the 64 1 ogi cal di recti ons it di st i ngui shes -gffis isnidh,trees are re qui red to construct 
the snal lest fat -tree whi ch has onl y a si ngl e 1 ayer of uni t trees . The next 1 arger size can 
then be constructed by r epl i cati ng thi s srral lest structure 64times and usi ng another 1 ayer 
of uni t trees toint er connect these 64 subtrees . Enough uni t trees at this second 1 evel wi 1 1 
be needed to natch the bandwi dth out of the top of the first stage of unit trees. This 
pr ogres si on can be cont i nued to bui 1 d fat- trees arbi trari 1 y 1 arge. 
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Ikanples 

12K As an initial exanple, consider bui 1 di ng a 12Kproces sor machine. W can use a 
-B192 1 eaf cluster for the leaf. The next level is then bW l=t ldSuumft trees. This 
gives 192 • 64 =12288 processors usi ngg^tiJtel t a clusters and l| 4 jp i unit tree 
bl ocks . 

48K liing a cappi ng 1 evel at the t op of a si ngl e 1 ayer of uni t trees , we can support 48K 
processors, four tines as nany processors as the previous exanple. The nunber of unit 
trees si npl y needs to growby a fact or of four. Si nil arl y, four ti nes as many bideltaclusters 
are needed. Thus , t hi s i s constructed f roja2B§aF cl us t ers and 64j4^C uni t trees . 

768K Adding a full second level of unit trees allows us to connd<9!2 64 786.K" 
processors . Thi s re qui res abideltaleaf cluster for each 192 processors: 

192 
The 1 owes t 1 evel of uni t trees is composed of uni t trees gi ven by the f ol 1 owi ng: 

d 



768K 192 1 
192 "l2" 64 

a b c 

(a) is the nunber of bi delta clusters that connect to this level. (6) is the nunber of unit 
tree stacks necessary to satisfy the bandwi dth for §926 bigle<fc t B bl ock. (c) is the 
f racti on of the uni t trees f rom(6) that are actual 1 y used by a si ngl e bi del t a bl ock. (d) whi ch 
i s the product of (c) and (6) is the rati o of the nunber of bi del t a cl us t ers to uni t tree st acks 
re qui red to bui ldthis first level. The bandwi dth out of the top of a uni t tree is one- fourth 
the bandwi dth i nt o i t . Thus the next 1 evel wi 1 1 onl y re qui re one fourth as many uni t trees : 

1024 „„„ 

=256 

4 

This brings the total structure to 4g^9Bi ife 1 1 a clusters and 128@4|Funit tree 
stacks . 

Generalizing 

W can general i ze network sizes and compos i tion in terra of the nunber of uni t tree 
layers used for net work cons tructi on. Aga„jA,s Mie nunber of processors composing 
a si ngl e bideltaleaf cluster, i i s a parameter denoti ng the t ot al nunber of stack leve Is, 
i ncl udi ngthe leaf cluster stack; i i s constrai ned onl y t o be a non- negati ve i nt eger great ei 
than one. The t ot al nunber of processors supported by a network wi t h i— 1 uni t tree 1 ayer s 
al 1 of type ffl% i s thus $ ta i as shown i n Equati on 3. 1. 

iV tota ,=64( i4 ) • tf eaf (3.1) 
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This requires one bi del t a cl ust er f or^gjaphoftTessor s . 

N total 



N 



bidelta- 



Nleaf 



(3.2) 



The first layer of unit trees is compos c<j^bf - g^J CF 64* unit trees for each bidelta 

cl us t er . Each successive 1 ayer re qui res one- fourth as many uni t trees si nee the bandwi dt h 

out of the top of each uni t tree is one- fourth the bandwi dt h i nt o t he bottom Thus the 

total number of lMl-» unit trees necessary to cons truct a net work Qf^jjisesMrpl y 

as gi ven by Equati on 3. 3. 



N 



unit 



E 



N total Nl ea f 1 



\\Nleaf l'< 



N 



total 



12 



/ 1 



(« 



64 4(-?' 4 ) 



V 



/ 



Ntotal\ 

576 



j(l-4( 14 )) (3.3) 



Thus a network of N, ta i processor s is constructed wy^ tl $l e af clusters and^/V 
IT 64* uni t trees . 

Usi ng capped uni t trees at the top 1 eve 1 , the compos i tionis sli ght 1 y di fferent . The total 
nunber of processors is greater by a fact or of four because of the extra r outi ng st age. 



N to tai=±- 64 i4 ) • N ea f 
The nunber of bi del t a 1 eaf cl ust er s needed i s corrput ed as before. 

N total 



N bl 



delta- 



Nl ea f 



(3.4) 



(3.5) 



The nunber of stacks progresses the same. The difference here is that the top level is 
cons truct ed out of d$g c unit trees. Thus the nunber oi§ff unit trees, „jy and 
UT 64*c unit trees c ^& t are computed by Equati ons 3. 6 and 3. 7. 



N 



unzt 



E 



Njotal Nleaf 1_ 1 

N U af' 12 '64'4(^) 



N total 

576 



(l-4( 2 -)) 



N 



N 



total 



curat 



768- (4 -a) 



(3.6) 
(3.7) 



3.3.4 Aternative Uiit "free 



It is al so wort hwhi 1 e t o consi der an al t ernati ve uni t tree structure. In parti cul ar , thi s 
necessaryif we wishto construct entirely fat -tree networks ; that is if we wishto construct 
networks that have processors as 1 eaves rather thanbi deltaclusters. Al one, the uni t tree j us 1 
describedis i nadequat e for thi s purpose because it wi 1 1 al ways pr ovi de at 1 east 8 connect i ons 
ineachlogical di recti on. To mat ch the connect i ons to a si ngl e processor, we need a uni t 
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tree that provi des two 1 ogi cal connect i ons per di recti on. Thi s uni t tree can be usedbyitseli 
to construct such a conpl ete fat- tree net war k, or used onl y as the bottommost tree stage 
of a fat -tree ne t wor k ut i 1 i z jj^g iffli t trees for the c ons t r uc t i on of t he uppe r portion 
of the tree. For di s ti ncti on, thi s uni t tree will be ref ejjjgad to as ff 

Eault Tblerance 

lb get the desiredtwo output s , and hence i nput s , i n eachl ogi cal di recti on, we essentially 
need to scale down the size of the uni t tree construct ed by a fact or of four f romthat of 
the £^64* unit tree. In nai vel y seal i ng down the uni t tree si ze, to encounter one problem 
at the final down rout i ng stage. Each si ngl e rout i ng conponent becomes critical in order 
to pr ovi de ne t wor k c onne ctivitytofourprocessors; t hat is, uni ike t he c onpone nt i n ot he r 
stages, if one of these components fail, four processors will become unreachable. 

Thi s i dent i cal sort of probl e mo c curs inthebideltaclusters. As describedinSecti ons 1. 5 
and 1.5, t hi s was the reason for i npl ementi ng the al t ernati ve configurati on for BN1 as two 
separate 4 X 4 crossbars . WthBNl configured i n thi s manner at the bottommost down 
rout i ng stage, i ns t ead of havi ng a si ngl e conponent that makes the final rout i ng connect i on 
to 4 processors, we have 2 component s that make the final connect i on to 8 processors. Thi s 
way no si ngl e conponent is critical to the f unct i onal i t y of the network. 

To avoi dhavi ng a si nil ar probl emon the i nput s , we al so configure these as inthe bi delta 
configurati on. Wi le eachprocessor will use onl y a si ngl e connect i on i nt o the network at 
a ti me to guarantee that the network is not overloaded, each processor has two network 
connections. Each of these net work connecti on shoul d be connect ed to di fferent routing 
conponent s . The proces sor then guarantees to onl y use one of the two network connecti ons 
at a t i ne. Inthis nanner , no si ngl e conponent i n the first up routi ng stage of the network 
is critical since a given process or canal ways ori gi nat e a connecti on through the network 
through ei ther of the two routi ng conponent s t o whi chit is connect ed. 

Size 

The enti r e lTi% ends up bei ng one- quart er the si ze of |,h£ tfini t tree. A-T$i% 
i s thus conposed of a t ot al of 52 BN1 routi ng conponent s . The st ages are each conposed 
of one -fourth as nany routi ng conponent s as the corr espondi ng st ages64#i tuba UT 
tree sunmari zed i n Tabl e 3. 4. The 1 argest routi ng 1 ayer has 16 conponent s . These can be 
ar r ange das 4 X 4 c onpone nt s i n t he hor i z ont al pi ane . Thi s nake s e ac h s i d^o/ t he IT 
r oughl y hal f the size of each si de ofgl^ieuffit tree. Thus each si de of t hftgfMni t 
tree will be about one foot long. The nunber of stages is the same betw^g^the IT 
and IT 64% uni t trees so they wi 1 1 be the same hei ght . 

Charact eristics 

Thi s compos i t i on gi ves th%4^T uni t tree the charact eri sties shown i n Tabl es 3.5. 
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Level 


Ibut i rig Gbnponent s 





16 


1 


16 


2 


12 


3 


8 



Table 3.4: JT\% Oonponent Suimary 





T>tal Bandwdth 


Tbt al Logi cal Qiannel s 


Capacity 


Up fromLeaves 


128 


64 


2 


Down t owar d Le ave s 


128 


64 


2 


Up toward Boot 


32 


1 


32 


Down f romBoot 


32 


1 


32 



Table 3.5: tgj* Bandwidth 



Through Bandw dt h 



As wi th the UEn-% uni t tree all vertical routi ng vi as between routi ng 1 evel s i n the stack 
s t r uc t ur e ar e c ont ai ne d i n t he r out i ng c onpone nts. Since this is onl y a s c al e d do wn ve r s i on 
of the lT$i-» uni t tree, we have no newprobl ens wi th vert i cal through routi ng bandwi dth. 



Bbi 1 di ng Trees 

Bui 1 di ng f at trees wi thtjgTuni t trees is done i n mich the sane imnner as before. 
Thi s uni t tree can be used as the onl y ki nd of uni t tree in constructi ng the fat -tree. Alter- 
nat el y, i t can be used as the 1 eaf node i n pi ace of the bi del t a cl ust eg£g anmdtthe IT 
tree can be used to bui Id the rest of the structure for the tre§4# Uahhg tTees 
for the upper tree structure gives better routi ng per for nance and is hence pref erabl e. Thi s 
di fference in routi ng per for nance ari ses f romt he usage of the al t ernat e RN1 configurati on 
in the final down routing stage of t hg^fHini t tree. This nakes theg^' s routing 
per for nance si i ghtl yless desi r abl e than that e^fe thai JFtree where the 1 ast stage 
is configured nor rial 1 y. 

Si nee there i s a performance di fferent i al , its r easonabl e t o onl y consi (fe^usi ng the IT 
uni t trees at the bottommost 1 ayer of the fat- tree. The fat- trees so constructed are free to 
be any power of 64. 

N total =%V (3.8) 

Qie IT 642 unit tree is needed for each 64 processors. Thus the total nunber needed, 
Nfvniti i s gi ven by Equati on 3. 9. 

N total 



N 



fvrdt 



64 



(3.9) 
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The nunber of fF64« unit trees in the next level is det eriri ned by nat chi ng the total 
bandwidth out of the tops of th^gFunit tree layer with the bandwidth into each 
CF 64* uni t tree. As before successive uni t tree levels each re qui re one- quarter the nunber 
of the uni t trees as the precedi ng 1 evel . As such, Equati on 3. 10 gi ves thegji^nber of IT 
uni t trees neede d^^i 

N _ IX (Ntatal (<U_\ _J_\ _ X ( N total 

Nurut ~ 2^\ f.A ' \ K1«>J ' A(l±) I ~ 2-r I 2 56 • 4 




/ l _ 4 (14)\^total . } 

V / 768 y J 

Ekanple tfeingtwolevelsoff^^ style uni t trees on top of one 1 eve^gf sfflyi e 
unit trees, the fat -tree network will sirppo2-56(E^)r oces sors . This requires unit 
trees as f ol 1 ows : 

64 3 
N fm it = -^=64 2 =4if 



N, 



vmt 




] (1-4 1 *)) =320 



3.3.5 Wring Efetai Is for Uiit 1-ees 



As describedinSecti ons 1.3.2 and 3.2.2, inordertospecify more than 8 bi t s of routi ng 
inforimtion, the first routing byte mist be periodically stripped fromthe data stream 
Wthi n the uni t tree structure, this ope rati on shoul dlogicallybe performed at three pi aces. 

1. every fourth uni t tree up on up the routi ng path 

2. at the lateral crossovers bet ween the up and down routi ng trees 

3. upon entrance of a uni t tree at the top down routi ng 1 evel when the connect i on crosses 
over hi gher up i n the fat - tree. 

The first case will onl y be necessary when more than 8 bi t s are needed to speci f y up routi ng. 
Thi s becomes neces sary onl y when we have more thagifiV 64 processors so will not be 
1 i kel y to be necessary in the near future. The need for 1 at eral cros s overs is describedir 
Section3.2.2. Eecall that the wi ri ng cons trai nt s of Sect i on 2. 3 r ecormend that i nput s from 
different logical directions be di stri but ed across as nany di fferent routi ng component s as 
possi bl e. As such, 1 at eral crossover i nput s wi 1 1 onl y nake up a f racti on of the i nput s t o a 
given routing conponent and hence be configured di fferent 1 y than the rerrai ni ng i nput s ; 
fortunately, BN1 al 1 ows the the s wal 1 owpr oper t y t o be configured i ndependentl y for each 
input. The logical place for stripping bytes in the downwar d pat h i s at the entrance of 
each unit tree fromhigher in the fat- tree. This provides the most regular place for this 
f uncti on. Thi sis somewhat non- opti mal i n that the s wal 1 ow occur s bet ween every 3 stages 
of down routi ng, as opposed to every 4; this means that the routi ng byte is bei ng changed 
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si i ghtl y more f requent 1 y than i t coul d be i n best case. As such the 1 owtiro bi t s of each 
down r out i ng byt eare unus e d. 

The rotation of bits between unit tree stages mist be carefully arranged so that all 
paths through the fat-tree network rotate the routing bits equi val ent 1 y, as described in 
Sect i on 3. 2. 3. Obvi ousl y, each byte- wi de rout i ng path mist be rot at ed by 2 bi t s f ol 1 owi ng 
each up routi ng stage and f ol 1 owi ng each down routi ng stage. Si nee there are onl y three 
down r out i ng s t age s through each uni t tree, f ol 1 owi ng the 1 as t down routi ng stage there 
mist actually be a four bit rotation so that the bits are correctly aligned to enter the 
f ol 1 owi ng uni t tree. The lateral crossover connections pose the least strai ght forward bi 1 
rotations. Each lateral crossover mist guarantee that as the connect i on ent ers the down 
routi ng path, the bi t rot ati on i s consi stent withthe poi nt at whi ch the down routi ng path i s 
entered. Wi le this is easy enough to guarantee, thi s i npl i es that the amount of rot at i on i n 
each cros sover wi 1 1 di ffer dependi ng on the hei ght of the uni t tree inthe fat -tree; this result 
f romthe di ffer e nee i n the 1 ength of the up routi ng path traversed before encounter i ng the 
crossover. This requirement unfortunately, forces unit trees to differ slightly depending 
upon whi ch 1 ayer of the fat- tree they are i npl ementi ng. The effect of thi s di ffer e nee can be 
ni ni iri zed by 1 ocat i ng t he necessary rot at i on di ffer e nee enti rel y i n the hori zont al pc- routi ng 
boar d j us t above the up routi ng st age. Thi s localizes the difference in uni t trees to a si ngl e 
horizontal routi ng boar d. Of course, since the four stages of two bit rotations return the 
rotation to the original rotation, at most four such distinct boards will be necessary to 
bui ldanarbitrarilylarge fat -tree. 



3. 3. 6 Wre Account i ng 

There are a 1 ar ge nunber of wi res ent eri ng and 1 eavi ng each uni t tree stack in order 
to properly connect the unit tree in the network. These wires, by necessity mist go to 
a nunber of di ffer ent 1 ocat i ons . Wi 1 e each data path of 9 wi res coul d go to a di ffer ent 
destination, actually allowing themto do so woul d unnecessari 1 y conplicate the wiring 
pat t ern. To prevent thi s , we woul d 1 i ke to group together reasonabl ysizedwire bundl es t o 
i nt e r c onne c t uni t trees and c onne c t uni t trees toleaf nodes. 



IF 



64* 



The bandwi dth ent eri ng andl eavi ng the bot t omof t^g Glhi t tree stackis 1 ogi cal 1 y 
segregat ed i nto 64 di recti ons each of whi ch i s 8 channel s wide. Each channel is conposed 
of 9 bits. There are equally many channels going both into the stack frombelowand 
downward out of the stack. Goupi ng thi s together, we have 2 X 8 X 9 =144 wires ineach 
1 ogi cal di recti on. It makes sense to use this is at he standard wi re bundl e size. 

The bandwi dth 1 eavi ng the top of one of these uni t trees is conposes a si ngl e 1 ogi cal 
di recti on. However , si nee it will inturnbe connect i ng &4%oteftmt {fees, it nakes 
most sense to use the sane wi r e bundl e si ze. Si nee there are onl y one- fourth the bandwi dth 
exi t i ng t he top of the st ack, there wi 1 1 onl ybe 16 such bundl es out of the top of a uni t tree. 
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Ihit "free Type 


Ninber of Bundles 


Hindi es Size 


Total Wres 


CF 6 4* Bottom 
Top 


64 
16 


144 
144 


9216 
2304 


CF 6 4«c Bottom 
Top 


64 
12 


144 
144 


9216 
1728 


CF 6 4^ Bottom 
Top 


64 
4 


36 

144 


2304 
576 



Table 3. 6: Unit Tree Wre Bundl i ng f or External Connections 



IF 



64 *c 



Si nee the capped uni t tree g4^F, is si npl y a vari at i on on th&jgt? 1 it is very 
sinilar. The bandwi dt h i nt o the bottomis identical t^^theaS^ 1 . The wires out 
of the t op are di fFerent . In t hi s case, each uni t tree will potentiallybe connect i ng t o 3 or 
more others. For consistency, bundles of 144 wires can be used here as well. There are 
three 1 ogi cal di recti ons out of the 1 0^43$. a Jffich di recti on has wi t h one- fourth of 
the t ot al bandwi dt h out of the top of t^e^ tFThi s gi ves us 4 bundl es of 144 wi res for 
each of the 3 1 ogi cal di recti ons . 
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The bandwi dth i nt o the bot t omof t hegff uni t tree is distri but ed as 64 pai r s of 
channel s . Thi s means there are 2x2x9 =36 mres in each 1 ogi cal di recti on. Agai n, the 
bandwi dth out of the top of the uni t tree is one 1 ogi cal uni t . Thi s t op bandwi dth needs to 
be di vi ded i nt o bundl es the sane size as those out of the bot t oiryjfg tinai Entrees 
to which this will be connecting. This means the t op bandwi dt h s houl d be broken down 
i nt o 4 bundl es of 144 wi res each. 

thit Tree External Bundle Sunnary 

Tabl e 3. 6 sumrari zes the bundl es i n and out of the uni t trees discussedinthis secti on. 

3.4 Gbonatry 

3.4.1 Basic R*oper ties 

Fromthe preceding section, we see that the total nunber of unit trees needed to con- 
struct eachlevel decreases by a factor of four on each sue cess i ve 1 evel toward the tree root. 
Looki ng at the conposi t i on of each uni t tree, we see a comron convergence f rom64 uni t s 
of a gi ven si ze at one stage to 16 such uni t s at the next physi cal stage up the tree. The size 
of these uni t s i nvol ved i n the convergence i ncr eases froma si ngl e stack at the first st age by 
a factor of 16 for eachsuccessive stage. 
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Ekanple Looki ng back at the 768Kprocessor exanpl e of Sect i on 3. 3. 3, the t erninal 1 evel 
i s conposed of 4096i$ stacks , the bot t omtree 1 evel of lQI^ EZtacks , and the top 
tree level of 2566^ stacks. The convergence at the first level is f ;fg§isi64c_fis to 
a set of 16 lW<i% stacks . At the next 1 evel , these sets ff^lft HEcks f ormthe basi c 
logical unit. Si xt y- Four of these units, whi ch i s 64- 1&^= ls0t23fcJ2T, converge to 
16 of these sets of lQ$ff stacks at the root level. These 16- 16 =e4%6 ffacks 
thenformthe root structure for the fat-tree. 

3.4.2 G-ovtfh 

There is no recursi vel y repeat abl e, three- di nensi onal structure that keeps inter- stage 
i nt er connect i on di stances constant . Thi s can easi 1 y be seen by not i ng that : 

• The nunber of conponents needi ng i nt er connect grows exponentially. 

• Keepi ng i nt er- st age distances constant, the three- di nensi onal space of candidate lo- 
cations for conponents is bounded by cubi c growth. 

Thus , there i s no way to prevent i nt er- stage del ays f romgrowi ngbetweensuccessive levels 
as the systemis scaledupinsize. At best, we can hope t o keep the i nt er- stage del ay growth 
down to a r easonabl e 1 evel . 

Nate, however, that since the nunber of processors gf<?vih& S^stemgrows very 
rapi dl y. As such, we need onl y accomrodat e afewstages of growth i n order to bui 1 d the 
networks of interesting size for the near future. 

3.4.3 H>llowOibes 

Anat ural approach to accomrodat i ng t hi s 4:1 convergence i n a worl dlinitedto three- 
dinensions, is to bui 1 d hoi 1 ow cubes . If we select one side as the "top" of the cube, the 
four si des can accomrodat e four t i nes the surface area of the t op, and natural 1 y four t i nes 
the nunber of routi ng st acks of a given size. As such, the "sides" contain the converging 
stacks fromone 1 evel , and the "top" cont ai ns the set of st acks at the next 1 evel up the tree 
to which the "side" stacks are converging. The renaining side will reimin open, free of 
stacks . Thi s 1 ast si de coul d be used to shorten wi res si i ghtl y farther, but utilizingit int 
nanner woul d decrease accessibility(see Section3.4.5for further discussionof this issue 

Tbstart, let us consi der the first 1 evel of convergence of the fat- tree structure. W want 
to connect 64 stacks of a givensize to some nunber of others . Parti cul arl y, we wi sh t o 
connect : 64 1 eaf nodes ijp UT 64« ' s or 64 UfU% ' s t o 4 lMi-% ' s . Each si de is nade 
of 16 stacks of the 1 eaf size. Each si de is constructed by 1 ayi ng out the 16 consti t uent stacks 
i nt o a 4 X 4 array. The t op wi 1 1 be of conparabl e size dependi ng on the rel ati ve size of 
the stacks at each fat-tree level. Figure 3.1 shows a hollow cube configurati on i n whi ch 
the stacks at each stage are of equi val ent size (e# this will be the case when the 1 ower 
1 evel i siggl eaf clusters and the top j^gfUni t trees). Fi gur e 3. 2 shows a hoi 1 owcube 
configurati on where the top 1 evel stacks are four tines the size of the 1 ower 1 evel ( eg. thi s 
woul d be case when on the first 1 evel of a f ul 1 fat- tree network i n whi ch the top 1 evel was 
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Figure 3.1: First Level HbllowCube Geometry 




Fi gure 3.2: Hal 1 owQibe wi th Top and Si de St acks of LI fferent Si zes 



cons true ted from^ uni t trees and the bot t omriDS t stage was compos edg^ EShi t 
trees ) . 

At the next stage of convergence, we treat the unit we just created as the base unit. 
These can be arranged such that the tops of these hollow cube units are treated as the 
basi c uni t structure for thi s next stage. Then we arrange a pi ane of 4 X 4 of these for each 
si de. Apl ane of uni t trees of size equal to each of the si des is then used for the "top". Thi s 
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Figure 3.3: Second Level Hal low Cube Geometry 

next st age i s shown i n Fi gure 3.3 as the natural ext ensi on of Fi gure 3.1. Thi s pr ogres si on 
can be conti nued i n thi s nanner adiriftim 

3. 4. 4 Gbnvergence Si ze 

Beyond the first level, the size pr ogres si on is fairly strai ghtf or war d. The side size wil 
grow by 4 at each successive stage since the size growth of 16 i s accommodated in two 
di nensi ons . 

At the first level, the size of convergence depends on the size of the leaf stacks used 
compared to the lMi-% 's to whi ch they connect . Les^^ s wi 1 1 be re qui red the snal 1 er 



N, 



the leaf stack size. Ingeneral 64 leaf stacks e 



6^g stacks. Fromthis we 



can determine the side size of the initial convergence square, or "top", and hence the side 
1 ength for the hoi 1 owcube. The si de size for the cube wi 1 1 be the greater of the si de size of 
the "top" and the si de size of the "si des^'.bi/*h^ 1 ength of the si de of e^^Fs t ack 
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and li ea f be the si de 1 ength f or 1 eaf st ack; these 1 engths wi 1 1 be as gi ven by Equat i ons 3.11 
and 3.12, respectively. The 1 ength of a cube si de s ; sat hjin^en i n Equati on 3. 13. 
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2- 8 
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(3.13) 



As describedinSection3. l;. e 3^ w¥l 1 al ways be a mil ti pi e of three. For almost all sizes 
of i nt erest j ea ^wi 1 1 also be a miltiple of four. As such, the component arrangements 
will always be a square nunber and the ceiling functions can be dropped. This reduces 
t he t wo argument s of the nax f unct i on t o same expressi on. So Equat i on 3. 13 reduces to 
Equat i on 3. 14. 



L =8s 



chip 



INleaf 



(3.14) 



The fact that the two 1 engths reduced to essentially the same expressi on was predi ct abl e 
si nee the bandwi dt h out of the top of a stack i s one- quarter the bandwi dt h i nt o i t s bot t om 
The four si de bandwi dths , whi ch are al 1 f romthe top of stacks mat ch the bandwi dth of the 
bot t omof the stacks on the "t op". The surface ar ea i s proporti onal to bandwi dth i n thi s 
configurati on. 

3.4.5 Ifeatures 

Wi 1 e thi s hoi 1 owcube structure nay not be the most compact structure, it does exhi bi t 
a nunber of ni ce proper t i es . 

It exposes the entire surfaces whi chare i nter connect ed to one other. Since the band- 
wi dth of the i nter connect i on i s 1 ar gel y surf ace area 1 i ni t ed, thi s al 1 ows rraxi rrumexposure 
of the areas that need to be connected. 

Si nee the structure is "hoi 1 ow" the connect i ons can be wi red through the free- space i n 
the center. Wri ng through free- space i n t hi s manner, the naxi rrumwi re 1 errg^Sh i s onl y 
times the lengthof aside. Thi s naxi miml ength i ncr eases by roughl y a factor of 4 every 
time we increase the nunber of processors by a fact or of 64. Thi s factor of four i ncr ease in 
size is due to the factor of four growth inside size for eachsuccessive level of the hoi 1 o 
cube. 1 



A this centimes, it wll also be necessary to take the size of the "sides" of each square into account, 
raking the increase som\hat mre than a factor of four; for the sizes of present interest, this is not a real 
issue. 
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Thi s structure is hi ghl y re pi i cabl e. The pr ogres si on descri bed in Sect i on 3.4. 3 can be 
cont i nued arbi trari 1 y. The growth in wire lengths nay, however, prove to be undesirable 
for very large structures. 

In thi s f or in the i ndi vi dual stacks are reasonabl y accessible for repai r . Si nee the cubes 
are hoi 1 ow, i t i s possi bl e t o get at any i ndi vi dual st ack wi thout novi ng any other stacks . 
Thi s shoul d al 1 ow repai r and inspection wi thout interfering with the bul k of the network 
ope rati on. There nay be sons diffbulty with accessibility due to the ims s of wi re i nsi de 
the cube, but perhaps this canbenininized(see Section3.5.3). The "iri ssi ng" si xth wal 1 
of the cube can be used to al 1 ow entrance i nt o the cent er of the structure for rrai nt e nance 
and repai r . 

3.4.6 Qitimlity 

Thi s sol uti on seems practi cal whi 1 e gi vi ng reasonabl e per for nance for the sizes of interest 
for the next decade. It i s not known to be opt i rial for ni ni ni zi ng the i nt er- st age wi ri ng 
di s t anc e s . Fi ndi ng an opt i rial s ol ut i on t hat r e t ai ns ade quat e accessibilitytobe of practica 
interest is still an open issue. 

3.5 H)llowQibe Cbnst ruction 

3.5.1 Structure Size 

Eecal 1 f r omSecti ons 3.3.1 and 3.3.4 that ^a^hitfft tree is roughl y two feet square 
and one and a hal f inches tall vfa.i\£%l3Ks roughly one foot square. Arranging these, 
or sinilarly sized bideltaleaf clusters in 4 X 4 grids and bui 1 di ng cubes as j ust descri be 
in Section 3.4, creates structures with sides between 4 and 8 feet 1 ong. Fol 1 owi ng thi s 
pr ogres si on one step further, we see that the si de 1 engths growto between 16 and 32 f eet . 

Q earl y, networks of this size, incurrent t echnol ogy, are roomandbui lding sizedentities 
not sonet hi ng t o put on your desktop. 

3. 5. 2 Structure 

To achi eve the hoi 1 owcube structure of Section3.4, it is necessaryto construct "rooms" 
for net worki ng. The "wal Is" of these "rooms" will be tiledwithstacks inthe 4x4 arrange- 
ment descri bed, as wi 1 1 be the "cei 1 i ng". These stacks whi chtile the wal 1 s and ceiling will 
be arranged such that the tops of the wall stacks face into the roorn and the bottoms of 
the cei 1 i ng st acks face i nt o the room The wal 1 and cei 1 i ng thi ckness will be relatively srral 
since the stacks are thin. Current projections are for the stacks t'6 ft)feiakoulnl.5 
some cases, it will be appropriate for the "ceiling" structure to be something other than 
the t op or actual cei 1 i ng of the roorn a revi ewof Fi gure 3. 3 wi 1 1 nake thi s poi nt cl ear. Thi s 
shoul d pose no addi ti onal pr obi errs. 

To hoi d these stacks i n pi ace, the wal 1 will beastructural grid. It will besirrilartoa 
raisedfloor where the stacks are analogous to tiles and the wall structure is analogous to 
the floor gri d whi ch support s the tiles. Thi s gri d provi des sites for each of the st acks . It 
will then be possi bl e to slide the stacks in and out of their sites, and "1 ock" the mi nt o pi ace. 
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Die to the size and nat ur e of t hi s s t r uc t ur e , t he ml 1 wi 1 1 al s o have to pr o vi de s t r uc t ur al 
support. Adding an additional "real" wall to support the structure would prevent close 
packi ng and re qui re the structure to be nuch 1 arger. 

The gri d structure supporti ng the stacks wi 1 1 al so need condui t s t o al 1 owthe fluori nert 
whi chcools the stacks to flowt o the stacks . Si iri 1 arl y, condui t s f or power suppl y connect i ons 
will also be needed. The actual fluori nert punps and power supplies can be placed in 
adj acent room or on the si xt h si de of the cube. 

3. 5. 3 Wri ng 

Interconnect! on si gnal wiringwill be routed through the free- space i n the cent er of the 

room 

Wre franas 

Inst e ad of connect i ng the wi re bundl es directlytoeachstack, these will be wiredtoa 
wi ri ng f r ane. Thi s wi ri ng f r ane i s a uni t at t ached to the structural gri d of the wal 1 and 
will serve as a hat ch- door on each rout i ng side of a stack. All the wires to a stack are 
connected to the wi ri ng f rarre. Wen the wi ri ng f rane is closed and lockedinto positi on, 
it wi 1 1 be conpressed di recti y agai nst the rout i ng stack. The conpressi on makes electri cal 
cont act between the stack and the connect i ng wi res est abl i shi ng si gnal flow. The wi res are 
wired to the wiring frames instead of the stack to facilitate repair and ease of access to 
each stack. Thi s scheme allows a stack to be changed wi thout the need to di s connect and 
reconnect wi r es f r om80 or more di fferent sources . Wththewiring frame, the frame can 
si npl y be opened to re pi ace a stack. 

The wi re frame i s "1 ocked" i nt o pi ace when cl osed. Thi s al 1 ows fluori nert and power to 
flow i nto the st ack. Wen a frame is "unl ocked" to remove a stack, several things should 
happen. The power t o the stack shoul d be cut . The fluori nert flowt o the stack shoul d al so 
cease, and the fluori nert i n the st ack shoul d be drai ned. Additionally, i t nay be neces sary 
t o t erninat e the tr ansni ssionlines in some wel 1 - behaved nanner for the sake of the rest of 
the network. Wen the frame i s "re-locked", the fluori nert flow wi 1 1 need to be resumed, 
dice the systemhas had time to refill wi th fluori nert and get adequate power, it should 
be go through a reset sequence so that i t comes onl i ne i n a consi stent nanner. 

Cpt i cal Cbnnect i ons 

Wi 1 e the hoi 1 owcube pr ovi des adequate space for the wi ri ng re qui red, the wi ri ng wi t hi n 
the cube will still be qui t e form dabl e. Keepi ng track of the 1 arge quant i tyof wires will be 
a seri ous task, especially when probl errs occur withwiring conducti vi ty. 

An alternative that nay be technically viable by the time one of these networks is 
actual 1 y constructed woul d be to use opti cal i nt er connect i ons to provi de the wi ri ng through 
the cube. The cube i s basi cal 1 y free- space so that each 1 i ght beamneed onl y be ai med at 
its appropri ate receiver in order to effect i nt er connect . Si nee 1 i ght beans wi 1 1 not i nt erf ei 
wi th each other, there will benoneedto worry about the three- di mens i onal wi ri ng probl em 
of avoi di ng i nt erf erence i n the center of the cube. [ Bergrran 86] and [ W 87] discuss early 
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work to provi de 1 arge seal e opti cal i nt er connect for VLSI sys t ens . Of parti cular interest is 
their use of a holographic optical element to direct optical beans for interconnections and 
the potential for adaptive and dynairi c connect i on reconfigur at i on. 

Wth opti cal i nt er connect , si gnal pr opagati on ti ire across the connect i on is less sensitive 
t o wi ri ng nat eri al s . Long el ectri cal connect i ons wi 1 1 have del ay proporti onal to both wire 
1 ength and the perni ti vi t y of the nat e,r)i afe g« ven by Equati on 3. 15 where c i s the 
speed of 1 i ght . 

Long optical interconnection, incontrast, cones closer t o t he f undanent al 1 i ni t posed by 
the speed of 1 i ght as gi ven by Equati on 3.16 [ K anil ev 89] . 

^optical interconnect (o.loj 

Wthout wires occupying vol une in the center of the cube, the whole interior of the 
cube woul d be imchnDre accessible for repair and nai nte nance. 

Che potential problemthat would arise with optical interconnect is keeping the iril- 
1 i ons of 1 aser connect i ons i n proper al i gnnent . Thi s coul d be a very hard task and al i gn- 
nent night prove very sensitive to repair operations within the cube. Adaptive al i gn- 
nent schemes would virtually be a necessity to make the fine alignment of a systemof 
t hi s size tract abl e. Adapt i ve al i gnment woul d al 1 ow the syst emt o sel f adj ust i t sel f i nt o 
proper configuration. Perhaps a mature version of the programmable optical interconnect 
of [Kanilev89] woul d provi de a pot enti al candidate for such adapt i ve al i gnment . 

If ut i 1 i zed, the opti cal i nt er connect woul d, of course, be i nte grated as part of the wi ri n 
har ne s s . 

3.5.4 Mi nt e nance 

Che very i npor t ant i ssue for the hoi lowcube is that mai nt e nance is possi bl e, and that 
it is possi bl e whi 1 e the nachi ne i s i n ope rati on. Wth a systemof thi s size, and necessarily 
expense, ext ensi ve down ti me wi 1 1 be expensi ve. Additionally, wi th a syst emt hi s 1 ar ge, the 
nunber of f ai 1 ures per uni t time is ne cess ari 1 y proporti onal 1 y 1 arge r than i n smal 1 syst ens . 
As a fir st 1 i ne of defense agai nst these probl errs the syst e mi s desi gned t o be f aul t t ol erant . 
Wienfaults do occur, however, it will be necessary t o fix thembef or e they accurrul at e. It 
is very desi rabl e t o be abl e perf ormsuch repai r s wi t hout corrpl et el y di sabl i ng the nachi ne. 

The "unused" sixth side of the cube provides access into the interior of the cube. Ch 
thi s face a door or hat chcanbe placedto all ow access i nt o the i nt ernal structure. Wth the 
i ndi vi dual st acks 1 ai d out conti guousl y al ong the wal 1 s and ceiling, eachisreadilyaccessib 
without any need to di spl ace any other stacks. 

Asi ngl e stack can be removed and re pi acedrelatively qui ckl y. Wt h the wi ri ng at t achi ng 
to the wi r e frame, the whol e ope rati on of re pi aci ng a f aul tystackcanbe moderately short . 
The f aul tystack needs first tobelocated. Chce located, its wiring- frame can be "unl ocked". 
The stack can then be re pi aced and the wi ri ng- frame "r el ocked" al 1 owi ng the network t o 
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return to normal operation. The s tack whi ch has just been removed can then be taken 
el se where so that its f aul t s can be i dent i fie d and repai red. 

Since there are redundant paths through the network using different intervening stacks 
for r outi ng, it will be possible to remove an ent i re stack from the network whi le the rest 
of the net work renai ns i n operati on. The pr acti cal effect of thi s wi 1 1 be that some f racti on 
of the bandwi dt h wi 1 1 be "nissing" or rather appear faulty; this is how i t would looked 
anyway i f nuch of the routi ng stack was f aul t y. The network shoul dbe wiredsuchas to 
guar ant ee that eachoutput f romone 1 evel of the net work can rout e through several different 
stacks at the next 1 evel . Wt h thi s property and proper t erninati on of the connect i ons to 
the removed st ack, the network wi 1 1 si npl y f ai 1 to route any connect i ons whi ch at tempt 
to traverse the stack bei ng r epl aced. The ori gi nati ngprocessor will then retry the fai lee 
connect i ons 1 at er and either utilize another path through the net work, or the same one af t er 
thereplacement stackis powered up. 

3. 5. 5 Tfechnol ogy Seal i ng 

These consi derati ons , and the sizes assumed throughout thi s work, are based mos tl y on 
current, usable technology. Certainly, as interconnect t echnol ogy i ncreases and packagi ng 
sizes dininishfurther, this structure c an sinilarly decrease insize. Thi s decrease insize i 
1 i nearl y transl ate directlytoi npr overrent s i n the i nt er connect speed between component s 
and stages . 

One t echnol ogy that wi 1 1 perhaps be f easi bl e by the t i me systems of thi s size become of 
real pr acti cal i nt eres t is t he ki nd of packagi ng used on the Cray III [ Cray 89] . Utilizingthi 
ki nd of t echnol ogy woul d al 1 owthe size of the component structures , and thus the resul ti ng 
structure, t o be di nini shed by about a fact or of four t o si x. Si nil arl y, prospect s of wafer 
seal e i nt egrati on offer roughl y t he same size scaling advantages . 

3. 6 Long Wres 

It shoul d be cl ear f romSecti on 3. 5. 1 that 1 ong wires will be re qui red for i nt er connect i on 
between uni t tree stacks. Here long wres are any wi res whose 1 ength mist be 1 onger than 
the 1 ongest wire withina uni t tree stack. 

3. 6. 1 Strategy 

The clock cycle on the uni t tree and bi del t a stacks wi 1 1 be opti nized to be as short as 
possi bl e gi ven the operati onal speed of the routi ng component and the 1 ength of the 1 ongest 
wi re i n the uni t tree. Thi s means it will necessarily take mil tiple clockcycles for data t c 
traverse these 1 ong wires between uni t tree stacks. It is possi bl e to pi ace mil ti pi e data bi t 
on a set of wires s i mil t ane ous 1 y, but we mist be careful that the dat a i s kept in proper 
phase wi t h the cl ock i n order to as sure proper behavi our of the routi ng chi p. The proper 
phase can be as sured i n ei ther of a coupl e of ways . The i nt er connect i on wi re 1 engt hs can 
be careful 1 y chosen such that thei r del ays are al ways i nt eger mil t i pi es of 1 ength of the uni t 
tree clockcycle. Inthis manner, the phase is preserved by guar ant eei ng the del ay through 
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the wi re i nt erconnecti on i s sufffci ent 1 y wel 1 behaved. Alternately, tapped del ay 1 i ne buffers 
can be used to i nsure that the data is present ed i n the proper phase r el at i on wi th the 
uni t tree clock. Ascherre sinilar to [Bettberg 87] can be usedfor this purpose. If opti cal 
i nt er connect is used(Section3.5.3), usi ng wi res to nat ch i nt er connect del ays to the phase 
of the dat a wi 1 1 not be possible. In such a case it woul d be neces sary to use the tapped 
del ay 1 i ne al t ernati ve. 

This scheme will necessarily increase the latency of a connection, but the increased 
1 at ency i s i nevi t abl e gi ven the geonetri c cons trai nt s of routi ng. Thi s scheme does manage 
t o ni ni ni ze the effect s of seal i ng on 1 at ency by onl y si owi ng down those i nt erconnecti on 
stages which mist be long. 

3.6.2 Itequi Tenants 

The onl y addi t i onal re qui rements this scheme poses is on BN1, the routi ng component . 
In parti cul arl y, it onl y requi res di fferent behavi our f r omBNl when a connecti on i s turned 
around (see Secti on 1. 3) . Qirrentl y, BN1 expect s to get val i d data i n the new di recti on 
of flowtwo clock cycles following the reception of a byte indicating it should turn the 
connecti on around. The addi ti on of a si ngl e clockcycle's worth of wi re del ay causes t hi s 
turn around time to be increasedby two clock cycles; that i s one addi ti onal clock cycle 
is required for the turn byte to propagate across the interconnect to the next routing 
component, and one additional clock cycle is required for the return data to propagate 
back across the connecti on. Wth no extra clock eye les of wire del ay between BN1 routi ng 
component s, this turnwill occur in twD clock cycles. Wthfcclockcycles of wire del ay 
between a pair of successive routing stages, the turn will occur in 2(fc + l) clock cycles. 
In or der for the turn sequence to f unct i on properl y, the routi ng component at each end of 
such a 1 ong wi r e mist be capabl e of deal i ng wi th the extra cycl es . BN1 will needto know 
the nurrber of del ay cycl es to expect and be abl e t o deal wi t h the mac cor di ngl y. 

The delay size will need to be configurable for ever input and output port on the 
routing component. Wth 8 input ports and 8 output ports, this makes for a total of 
16 ports that need to be configured on a single BN1 component. It is necessary to be 
abl e to configure each i nput and output port separately as est abl i shed i n Secti on 2. 3 t o 
allowinput and output ports fromthe same component to connect t o i nt erconnecti ons of 
di fferi ng del ay 1 engths . Addi ti onal 1 y, t o al 1 owf or the 1 arge range of del ay val ues that mist 
be acconmDdat edf or reasonabl y 1 arge net wotrltes del ay 1 ength configurati on wi 1 1 requi re 
mil tiple bits to specifythe del ay of a si ngl e i nput or output port . Wii 1 e thi s configurati on 
i nf or mat i on coul d be pr ovi ded t o the chi p by configurati on pi ns , it shoul d be obvi ous that 
this would require the addition of quite a large number of signal pins. EN1 already has 
tight constraints on the number of pins that its package will support. Thus, an alternate 
me ans mis t be usedto c onfigur e an EN1 r out i ng c ompone nt wi t h t hi s i nf or mat ion. 

Che such al t ernati ve i s to use UVprogr armabl e eel Is inthe routi ng chi p to store this 
configurati on data. Thi s woul d requi re the addi ti on of onl y a few si gnal pins tofacilitate 
t he pr ogr armi ng of the UV configurati on eel 1 s . Cells woul d be pr ogr anme d by i ni t i at - 
i ng a program sequence and shifting the configuration data into the UVcells while the 



2 See Secti cm 4.1 for jroj ected wre and del ay cycl e lengths insysteracf various sizes. 
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conponent is exposed to UV1 i ght . [Qasser 85] describes a t echni que for constructi ng UV 
programrabl e cells of this nature whi ch i s appl i cabl e to the one iri cron CMD6 process in 
whi ch RN1 i s fabricated. This scheme has the drawback that a conponent cannot simply 
be repl aced wi th another one off the shelf. The replacement component will first need to 
be configured before i t can serve as a repl acement . 

A si mpl e progr amri ng board can easi 1 y be bui 1 1 t o program the configurat i on i nt o a 
component as necessary. 

Wth the addi t i onal configur ati oni nf or mat i on provi ded t o BN1, i t mist al so be updated 
to deal appropriately with the configured delays of various lengths. This will required 
updating the finite state machine logic in the routing component to deal with varying 
del ay 1 engths . This will be airinor change, but will necessitate adding additional state 
information to the FSM s . During the additional delay cycles, some reasonable data or 
pat t ern shoul d be sent t o the component not di recti y connected to the 1 ong wi re so that 
the syst emi s guaranteed to be wel 1 behaved. Af t er the status and checks umbytes are sent 
t o thi s component , the conponent on what was ori gi nal 1 y t he sendi ng si de of t he 1 ong wi res , 
will needtosendthis addi ti onal data. Thi s addi ti onal data coul d si mpl ybe arepetitionof 
the status and checksumbyt es . 

3. 7 R*ocessors 

Processingelements can be attachedinastrai ghtf or war df as hi on to thi s network config- 
ured i n the geometry des cri bed. Conceptually, each net work t erni nal poi nt wi 1 1 consist of a 
processor, memory, and a cache- control 1 er ( See Fi gure 1. 1) . Agi ven nunber of pr oces sor s 
will be associatedwitheachleaf stack, whether a bi del t a 1 eaf ck^tiaiiidjrta;^. 
As pr evi ousl y not ed, each leaf stack is rather thin. The processors and their associated 
component s for a gi ven 1 eaf can al so be arranged in a stack structure. The si des of thi s 
stack structure will be made the same size as the relevant routing stack. The processor 
stack can then be 1 ayered unti 1 it accommodates all the necessary component s . Since the 
nunber of processors associatedwithaleaf routi ng st ack i s general 1 y about the same as 
the nunber of routi ng component s i n the 1 eaf stack, the pr oces sor stack structure wi 1 1 be 
of a thi ckness whi ch i s the same order of magni tude as that of the 1 eaf routi ng stack. Thi s 
processor stack can then be abut t eddirectlyto the leaf stack or evendirectly connected, 
maki ng i t part of the same physi cal stack. Si nee the stack structure tends to keep vert i cal 
di s t anc es short, this allows the processors to be withinreas onabl e pr oxi nityof the r out i ng 
network. Also since the stacks are thin, this will accommodate space for the processors by 
onl y maki ng the "wal Is" of the hoi 1 ow cube ( Sect i ons 3. 4 and 3.5) si i ght 1 y thi cker . Thi s 
addi ti onal thi ckness shoul dnot be si gni fie ant enough to change the overal 1 si ze of the hoi 1 ow 
cube structure appreciably. 

3. 8 R>ut i ng Gbnput at i on 

The arrangement of components and structures descri bed so f ar essenti al 1 y deter nine 
the composition of a routing sequence for this network. In this section, this routing is 
sunmari zed i ncl udi ng a scheme for the proper gene rati on of the routi ng sequence. 
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3.8.1 Ustinguishing a R*ocessor 

Each leaf processor needs to have a uni que speci neat i on so that it can be referenced. 
In a fat -tree structure, the 1 ogi cal speci fie at i on for a processor is its "addres s" or 1 ocat 
relativetotheroot of the fat -tree. Inthecaseofafull fat -tree, this will si npl y be the p 
down the fat -tree to the processor. In a hybri df at -tree wi th bi del t a clusters as 1 eaves of 
the fat- tree network rather than i ndi vi dual proces sors , thi s address wi 1 1 be the path from 
the root and the 1 ocat i on of the proces s or i n the 1 eaf cl us t er . Thus , aprocessor is speci fie c 
as , L =C o P, where Gi s t he path f r omt he fat -tree root totheleaf, and Pi s the 1 ocat i on 
of the processor wi thi n the bi del t a cl ust er . Eecal 1 f romSect i on 3. 1. 1 that rout i ng out of 
the bideltaleaf cluster stack is acconpl i shed when the hi gh t wo bi t s of the i ni ti al routi ng 
byte are 1' s ; as such, the hi gh t wo bi t s of a processor speci fie at i on wi 1 1 never be 11. It is 
easiest for routing if Ci s exactly the routing sequence necessary to get fromthe root to 
the de sired leaf. Eecal 1 f romSecti on 3.3.5 that the 1 ow t wo bi t s of each byte i n a down 
r out i ng s e que nc e t hr ough uni t trees is unus e d s i nc e onl y six bits of r out ingis perfornedon 
the down routi ng path through each uni t tree. The val ue of these unused bits is irrel evant , 
but for cl ari t y, they shoul d probabl y be consi dered zero ( 0) . 

3. 8. 2 R>ut i ng 

The routi ng sequence will be the series of bytes necessary to open a connect i on to a 
desireddesti nati on. In general thi s consi st s of three part s , the up routi ng sequence, M the 
down rout i ng sequence, H), and the rout i ng wi thi n the bi del t a cl ust er , J? Of course, full 
fat -trees will not have the iCconponent . Each of the conponent s of the routi ng sequence 
will occupy an i ntegral nunber of bytes ; a conponent will be padded to byte 1 ength when 
necessary. This keeps the portions of the routing sequence conceptually separated and i s 
consistent with the previous specifications for the placement of swallows (Sections 3.3.5 
and 3. 1. 2) . The separat i on of the conponent s of the routi ng sequence intotheir own bytes 
al 1 ows the routi ng bytes to be gene rat ed wi thout any needtoshift bits around wi t hi n bytes . 
Acoirpl et e routi ng sequence, _§ i s si npl y Wo Sh BO 

3. 8. 3 Gbnput i ng t he R>ut i ng Sequence 

Wth thi s configurati on, we can corrput e the necessary routi ng sequence iM th moder- 
ate ease. IM ikethebideltacase, it isnecessaryto know the source 1 ocat i on i n the network 
i n or der to det eriri ne an opti rial routi ng sequence; t hi s addi t i onal i nf or nati on i s necessary 
i n or der to expl oit localityinthefat-tree structure IJebeiihe source processor 
and L2 =C 2 ° P be the des ti nati on proce%sTfoe. corrput at i on of the routi ng sequence 
then proceeds as follows: 

1. I =C\ © C2; that is 7is thebit-wiselogical exel us iivanobiCjpi' C 

2. M= a bi t vector wi th 0' s for al 1 1 eadi ng zero bytes and 1' s for al 1 bytes fromthe 
1 eadi ng non- zero byte t o the end; that is Ms a vector whi ch narks al 1 si gni fie ant 
bytes . 



3 TMs degenerates to the specific case of full fat trees \faen P 1 = Pz =£ (the enpty string). 
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C 1N<7:0> B> ?- — . 

1 ] x8 ) ?- <g IN<7:0> 

C 2N<7:0> [^— ° — 



Fi gure 3. 4: Byt e- wi de xor SI i ce. 

3. j =1 ocati on of hi ghest- order non- zero bi t inl(l based) 

4. 0= < 



10 i f bi t 6 or 7 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 
01 i f bi t 4 or 5 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 
00 i f bi t 2 or 3 i s hi ghest non- zero bi t i n hi gh non- zero byt e of I 



5. W=(ll) j o (o) o (^^) ra)d4 ) ; the trailing 0' s sinply sufffce to pad ^to an 
i nt e gr al byt e quant i t y. 

6. H)=\ owj byt es of Ci- Thi s wi 1 1 , infact, i ncl ude unnecessary routi ng i nforimti on. 
However, thi s quant i ty needs tobe padded to an i nt egral nurrber of bytes i n any case. 
Wen the bi t- rot at i ons are wi red appropri at el y in the crossover routi ng paths , thi s 
al 1 ows EX o be generated wi th irini rial cal cul ati ons . 

7. K!=P 2 

Nate that when Afeval uat es toall zero, it is the case =fcKa$ aSld Xland itare 

unnecessary; i n thi s case, JEf ul 1 speci fies the routi ng t o the final desti nati on si nee the two 

processors are i n t he sane leaf cluster. 

3.8.4 I nple rant at ion of (imputation 

It is enl i ght eni ng t o consi der possible structures t o efffci entl y i npl enent the corrpu- 
tationjust described. In order to talk about individual bytes and bi t s , let us adopt the 
f ol 1 owi ng not at i on: 

• B< n> refers to the 7th bit of B, wi th the 1 owest bi t bei ng bi t zero. 

• BiVref er s to the Nh byte of B, wi th the 1 owes t bi t bei ng bi t zero. 

• BiV<ra> thus refers to the rthbit of the Nh byt e of B. 

The i nt ernedi at e quant i ty I, being an xor, is desi gned t o be cheap t o cal cul at e. 

The value Afrequires only the knowledge of which byte is the first to contain non- 
zero bi t s . Thi s al 1 ows al 1 the bi t s i n each byte of I to be or' ed together for corrpari son. 
The result is a snal 1 bi t vect or . Si nee thi s quant i tyis then a snal 1 nunber of bi t s , it is 
possible to let the first non- zerobit sti mil ate the others so that a vect or results withal] 
t he s i gni fie ant byt e s nar ke d wi t h a 1 bi t . Fi gur e 3.5 shows this c orrput at i on f or a si ngl e 
bi t i n M 
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Fi gure 3. 5: Gal cul ati on of a Si nerl eHt of M 



IN<7:0> B~ 



/B~ 




<g 0N<1> 



-<J} 0N<2> 



Fi gure 3. 6: Gal cul ati on of a Si ngl e Bi t of 

Both of the conput ati on of /and Mian easily be done ei t her conpl et el y in paral 1 el , 
all of ](7and C2 at once, or in a byte serial nanner as i s appropriate to a particular 
i npl enent at i on. 

In both cases the conput ati on of Ocan be cal cul at ed on each byte i n 7s i mil t aneousl y 
and i n paral lei wi t h the conput at i on of M The val ue of Mm 1 1 det eriri ne whi ch conpu- 
t at i on of Os houl d be us e d. The c onput at ion of Ocanproceedsinilarlyto Mml y wi t h a 
granul ari ty of pai r s of bi t s i nst ead of byt es. Infact it is necessary to 1 00k onl y at the hi j 
two pai r s of bi t s in order to det er nine In thi s case, i nst ead of s ti mil ati ng the 1 ower 
bi t ( s ) i n t he bi t ve c t or , we i nhi bit sot hat (N<2 : 1 >is the correct val ue for us e i n t he 
cons tructi on of M See Fi gure 3. 6 for an exanpl e of the cal cul ati on of 

W can assume for the moment that Sim 1 1 be a single byte quant i/#y.an be 
computed f romMnd OH n a coupl e of gate del ays as shown i n Fi gure 3.7. 

Si nee byt es are sent byt e seri al i nt o t he network, no conput at i on i s real 1 y done f or E) 
M dent i fie s whi ch byte of 2 (So st art wi th when it is time to start sendi ng HA nt o the 
network. Similarly, no conput at i on i s done for SO 

Fromthese si npl e construct s , we can see that the conput ati on of a routi ng sequence i s 
an i nexpensi ve operati on and can be done i n a reasonabl y snal 1 amount of ti me. A qui ck 
si mil ati on of thi s i npl ement ati on i n a 1 m cron CMD6 p^oaeesd at es .fffand Mn 
less than 6ns . 



TMs wll be the case until rare than about 50 lillion processors are connected in this rarmer; this 
schem easily generalizes to the mltiple byte case. 
5 the sam processes in\iichIN is fabricated 
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Fi gure 3. 7: Gal cul ati on of MromMnd O 



3. 8. 5 Ekanpl e 



Consi der the fat- tree bui 1 1 Yft# hefives and t wo 1 evel s o^^Tum t trees described 
in Section 3. 3. 3. Thi s gives a fat -tree with 786, 432 processors; Cwi 1 1 be 2 byt es 1 ong and 
Pwi 11 be one byte long. Lejt =ZJ0a:<2D234 and L 2 =0aS4267. 

1. 7 = 0x0002 © 0xE842 =0x1440 

2. ill=0a03 

3. j=1 

4. a =10; a =01; since ill=0a03, Q i s the correct value for O 

5. MT=(11) 2 o (01) o (tfO^llllOlOO =0aF4 

6. 53=02^42 

7. JE=0a67 

Thus the correct routing sequence is J£fo Hh BO= Oai4_04267. This will open a 

connect i on out of the bi del t a stack, through the first stage uni t tree, and i nt o 1 evel 2 of the 

next uni t tree. At that poi nt , i t has crossed over to the down path so the first byte, RT\ s 

s wal 1 owed. It rout es down through that stack with the rel evant porti on of the next byt e. 

That byte is then s wal 1 owed and the next byt e, 0a42, isusedto rout e back through the 

appropri ate uni t tree int he first stage. Thi s final itbyt e i s s wal 1 owed and the 0a67 byte 

is used to rout e through the bi del t a 1 eaf cl ust er to the final desti nati on. 
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4. Aial ys i s 



Chapters 2 and 3 described the constructi on of Transit fat-tree networks. This chapter 
quantifies some characteristics of the resulting networks. Considering fat-tree networks 
wi th both [2642 leaves and bi del t a cl us t er s leaves, a progressi on of net works f romf ul 1 
bi del t a networks tofull fat -trees can be compared and anal yzed. Section4. 1 describes the 
di st ri buti on of wi res by 1 engthi nhol 1 owcube configurat i ons . Sect i on 4. 2 extends the bi del t a 
network structure beyond the si ngl e stack size for the purpose of conpari sons . Section4. 3 
s unmar i z e s ne t wor k size free doms for t he var i ous ne t wor k c onfigur ations. It thenuses size 
as a par arret er to quant i f y hardware re qui r errent s and network 1 at enci es . Secti on 4. 4 uses 
the development of the previous sections t o provi ded some concret e net work exarrpl es for 
conpari son. Fi nal ly, Secti on 4. 5 uses si rrpl e probabi listic mode Is to provi de basi c rout i ng 
statistics for the networks of Section4.4. 

4. 1 Wre Lengths 

Wre 1 engths are a si gni fie ant concern i n bui 1 di ng 1 arge networks . As the sys t emgrows , 
the number of clockcycles re qui red to rout e between uni t trees becomes the doirinant com 
ponent of network 1 at ency. The hoi 1 owcube structure (Secti on 3. 4) keeps wires moderately 
short considering the magnitude of the wire convergence that mist occur. This section 
quant i fies the di st ri buti on of wi res by 1 ength i n hoi 1 owcube geometri es . 

4. 1. 1 TOrst Ckse 

Utilizing free- space wi ri ng in the hoi 1 ow cube, the 1 ongest wires will be those that 
traverse the cube's diagonal. These wi res ^Blii hes the length of the cube side 
long. The length of the cube side depends on the size of convergence for the cube. As 
seeninSection3.4.4, for st andar d hybri df at -trees, the sides will be four bi del t a 1 eaf s t ; 
si de 1 engths 1 ong at the first convergence and i ncr ease by a fact or of four at each successi ve 
stage. Ful 1 fat -trees siirilarly progress f romsi des of 1 ength es^jalsidrftung^fihs 
at the first convergence and progress by a factor of four at successive stages. The length 
of the side of a stack is det eririned by t he rraxi mimnunber of routing components on a 
s i ngl e r out i ng s t age . 

4.1.2 Ustri but ion of Lengths 

Chi y a srral 1 f r acti on of the t ot al wi res wi t hi n a hoi 1 owcube are of the worst case 1 ength. 
Perhaps a more interestingmetric is the distri buti on of wi res by t hei r 1 ength. 

For si rrpl i ci t y, let us normal i ze wi re 1 engths t o the 1 ength of the stack si des . Thi s al 1 ows 
the deri ved di stri buti ons t o be appl i ed general 1 y r egar dl ess of stack size. The normal i zi nj 
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length., gjack, will be the length of aside of the conponent stacks at the terninal level of 
the fat-tree s wi/1 1 denote the nunber of stacks al ong the si de of cube. 

Each i nt er connect i on wi thi n the hoi 1 owcube connect s f ronia "si de" to the "t op" of the 
cube (See Fi gures 3. 1 and 3.2). Connect i on endpoi nt s are evenl y di stri but ed across the two 
di nensi ons conposi ng each the si des of the hoi 1 owcube; this results si nee eachstackinthe 
si des has the sane di stri but i on of i nt er connect i on to the top. Each of the st acks on the 
hoi 1 owcube si des wi 1 1 connect to endpoi nt s whi ch are evenl y di stri but ed across the surface 
of the top of the cube. Usi ng t hi s uni f oriri t y, we can easi 1 y char act erize the wire length 
di stri buti ons . 

W can start by de conposi ng t he di stri buti on i nt o separate di nensi onal conponent s , x, 
y, and z. Si nee these di stri buti ons are i ndependent , we can then f ormthe t ot al distribution 
fromthe product of the di nensi onal distributions. To expres s the di nensi onal distributions, 
a c oupl e of s i npl e pr obabi litydistri but ionare ne e de d. 

lid formlhi di va ns i onal El stri but ion 

One case of interest occurs when the distribution is conpletely uni f ormacross the 
possible space of lengths. For such a case, we have a uni form di stri buti on. Thus the 
di stri buti on f uncti on i s si npl y that of Equati on 4. 1. 

f , \ TT l<*o<iV s 

Px(zb) =< "° u • I 4 - 1 ) 

10 otherwi se 

Lhi f romll fferenee H stri but ion 

The other case of interest occurs when the di stri buti on i s that of the di fferenee between 
two val ues pi eked randonl y fromthe sane uni formdi stri buti on. In this case, we have a 
di stri buti oj(/^) as i n Equati on 4. 1, and we want the di s tri buti on for the qjiaa||i ty \x 
where xj and x\ are descri bed by. p Thi s di s tri buti on i s descri bed by Equati on 4. 2. 



Pdx{xo) 



± x o =0<N s 

^^P 1 0<x <N s (4.2) 

otherwi se 



HillowCube El stri but ion 

Wth the si npl e di s tri buti ons j ust descri bed, the t ot al di st ri buti on for wi re 1 engt hs i i 
a hollow cube can be derived. The distance the wire mist traverse in the vertical ( z) 
di nensi on wi 1 1 be descri be d^bfi /ice al 1 interconnections connect fromthe top to some 
distance down the side of the cube. Sinilarly, the distance the wire mist traverse "into" 
the cube, normal to the surface of the si de, wi 1 1 be det errri ned sol el y by the 1 ocat i on of the 
dest i nat i on on the top surface. Thi s di nensi on, whi ch f or the current devel opment wi 1 1 be 
called ^can also be described by jJn the renai ni ng di nensi on, whi chwill be referred 



NB In an abaci ute fram of reference, this dirHnsian\«>uldbe the x dinansianfor tw> faces and the 
y dirHnsianfar the tw> faces adjacent to those. 
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Fi gure 4. 1: El stri buti on of Normal i zed Wre LengtJis=floamitfl6, Eespecti vel y. 
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Tabl e 4. 1: Expect ed Wre Lengths Normal i zed to St acL^^e, I 



t o as Xf t he di stance traversed by the wi re depends on the rel ati ve pi ace ne nt of both the 
source and des ti nati on of the wi re and thus wi 1 1 be described by p 

Wth thi s under standi ng of the di nensi onal di stance di s tri buti ons , we can descri be the 
wi re 1 ength di stri buti on by Equati on 4. 3. 



max( / )nax( / ) nax( / ) 

pi(h) = Y, Y Y \p*( z ) ■ ^y) ■ »( cfc ) • utih^ttz) 



z=0 y=0 <£e=0 



imx( I) = \VZN S ~\ 



(4.3) 



(4.4) 



The f unct i on pi s si npl y used to deternine whether or not to i ncl ude the probabi 1 i t y for 
a gi ven (d^y,z) conbi nati on i n the sumand is describedby Equati on 4. 5. 



Wi(lo,(%y,z) 



1 h = \^de 2 +y 2 +z 2 ] 
otherwi se 



(4.5) 



The di stri but i on;, ip easily conputed for a given val u§ efidAgives the resulting 

lengths i n uni t s pfc^Z. Figure 4.1 shows the di stri buti qn£oa>fi^>= 4 and 16, the 

first two si de 1 engths for hoi 1 owcubes . The expected, or average, wi re 1 engt hs deri ved f rom 

t hi s di s t r i but ionare s ho wn i n Tabl e 4 . 1 . 

4.1.3 H stri but ion by Efelay 

dice we have the nornal i zed di stri buti ons , it is easyto convert these i nt o del ay di st ri - 
buti ons for a parti cul ar 1 eaf st aqj^s^i zThife , of course, i s the i nport ant performance 
rretri c. 
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The previ ous di stri but i jm^api be converted i nt o del ay di st ri buti on when the cl ock 
cycle and stack size are given. Wth cl ock c aptd6? sta £fc, the nunber of delay cycles 
between stages is relatedtothe wire del ay by Equati ons 4.6. 



Tlcycl es\ ) 



tdelay(b) 



tr 



(4.6) 



Usi ng an appropri ate no del for the propagati on ti ne of a si gnal across a gi ven 1 ength of 
wire, we can rel ate the nunber of del ay cycl es di recti y to the nor rial izedcjwire 1 ength I 
Equati on 4. 7 gi ves thi s rel at i on assuring the si gnal is free to propagate at the speed of 
light, c, as woul d be the case for optical interconnect. 

t Mau (b)=^-^ (4.7) 

Wth these rel at i ons , t he wi re 1 ength di stri buti on can be convert edinto a distri buti on for 
the nunber of del ay cycl es between stages i s corrput ed as Equati on 4. 8. 

nax(£ ) 

Pn(no) = J2 tP'(fc) • W<Wo)] (4.8) 



dice agai naselection f unct i,c(n^),^ft) , is us edto select the appropri at e 1 engths whi ch 
correspond to a gi ven del ay. 

w n (^lo)=\l "J= ^'"Wl (4.9) 

v ' otherwi se v ' 

Current projections nakeR^lOns ( Secti on 1. 3) . Eecall that £4$ , Wj, tac k ~ 1' 
(Section 3. 3. 4). Equati on 3. 12 gi ves a rough approxi nat i on of leaf stack size for bidelta 
clusters. Usi ng these val ues , several di s tri buti ons for a hybri d and full fat -tree hoi 1 owcu 
structure are shown in the f ol 1 owi ng figures . Fi gure 4. 2 is the delaycycle distri buti on i n 
a hybri d fat- tree hoi 1 ow cube usjg&gl daves . Fi gure 4. 3 describes the distri buti on for 
both hybri d fat- trees us ijagl i&ve s and full fat-trees usl^sgg fflaves. Figure 4.4 
corresponds to the di stri buti on for a hybri d fat- tree hoi 1 ow^dbca we fch Mies e 
figures al 1 par al 1 el Fi gure 4.1, but the di s tri buti ons here are giveninterm? of cycl es of del a 
Tabl e 4. 2 gives the expected nunber of del ay cycl es for these configurat i ons . 

4. 2 Large H delta ]>et works 

For conpari son, it is necessaryto consi der bi del t a networks seal edto sizes corrparabl e t o 
the fat- tree networks under consi der at i on. However, si nee we cannot si npl y bui 1 darbi tr ari 1 y 
large stacks (Secti on 1.4), largebidelta networks mist be deconposed i nt o mil t i pi e stack 
s t r uc t ur e s . 
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Fi gure 4. 2: El stri buti on of Bel ay Cycl § s=0offlnJ5F 1 6 , respecti vel y Efci 3p@_Beaves . 





Fi gure 4. 3: El stri buti on of Eel ay Cycl §s=f4mn&T 16, respectivelyEfci ipglfiaves . 
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Fi gure 4. 4: El stri buti on of Bel ay Cycl e J s=f4mn&T 16, respectivelyEfci mglfiaves . 
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Tabl e 4. 2: Expect ed Nunber of Itel ay Cycl es 
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4.2.1 H delta Stacks 

Perhaps, the ideal size for the constituent stacks is four routing stages. Wth four 
stages of rout i ng, the stack di sti ngui shes 256 1 ogi cal destination. Thus each stack perform 
a byt e' s worth of routi ng; afresh rout i ng byte can be used to route through each stack. 

The final size of sue ha configurati on wi 1 1 then depend on the nunber of output s provi ded 
i n each of the 1 ogi cal directions. The snail est configurati on woul d provi de two output s per 
logical direction. In general, such a configuration would only be desirable as the final 
routi ng stack conposi ng a 1 arge bi del t a network. In or der to provi de proper f aul t t ol e ranee 
with two outputs in each logical direction, the final stage of the network mist utilize 
the alternate BN1 configurati on where each crossbar provides a single output per logical 
direction (Section 1.5). Since this routing stage's performance is inferior to that of t] 
other stages, in order t o achi eve opti rial r out i ng per for nance onl y the final routing stack 
shoul d concent rat e the nunber of output s ineachdirectionto two. 

Beinglinitedtoat most 64 component sin each routi ng stage (Section3. 3. 1), we are not 
free to consi der bi del t a stacks that both di sti ngui sh 256 di fferent des ti nati ons and provi de 
more than two output s i n each 1 ogi cal di recti on. Ingeneral if a stack di sti ngui shes nl ogi cal 
destinations and has /outputs i n each 1 ogi cal dest tmaotuitding JVbnponent s will nake 
up each routi ng stage as gi ven by Equat i on 4. 10. Thi s rel ati on f ol 1 ows t ri vi al 1 y from the 
fact that each BN1 conponent has eight outputs. 

N c =^ (4.10) 

As such, we are onl y f ree to consi der st acks in whi chra- Z < 512. For cl ari t y bi del t a stacks 

wi th parameter s raand I wi 1 1 be ref erencedj-as iffromthis discussion, we can concl ude 

that -3562 , 1&4% , and ELq% stacks should be used only as the final stage of stacks in 

1 arge bi del t a networks . The largest stack reasonabl e for use i n f oriring the earl i er network 

stages i s t] 



4.2.2 Araiiging H delta Stacks 

The bi del t a st acks are then treat ed as the standard routi ng uni t s . They are arranged 
into stages in order to bui 1 d 1 arge r bi del t a net works . Wthi n the bi del t a st acks , the cl ock 
cycl e can be as fast as a si ngl e stack constr ai ns it to be. Addi t i onal stage del ays wi 1 1 be 
incurred between stack stages as is the case withf at -trees because the wires will be 1 ong. 
However, this extra delay is onl y i ncurred bet ween stack stages and at the final recycle 
path. The 1 ong wires can be dealt withas describedinSection3.6. Thi s configurati on wi 1 1 
gi ve bet t er per for nance than uni f orni y si owi ng the cl ock rat e down everywhere in order 
t o accomiDdat e the propagat i on del ay across the 1 ong wi r e paths . 

For replication, it will be neces sary t o pi ace swallows between the stages of routing 
stacks . In thi s nanner , a separat e byt eisusedto route through each bi del t a stack. 

Si nee the nunber of i nt er- st age del ays does not affect the total cl ock rat e of the system 
it nakes sense to inter-wire the stacks in an indirect binary cube network style. This 
al 1 ows the wires at the first fewst ages to be relatively short . Successi vel y 1 onger wi res ar 
used between successi ve stages of routi ng stacks . Thi s structure can be laid out i n t wo 
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Figure 4.5: Indirect Binary Gibe Tbpol ogy 

di nensi ons so that the critic al geonetri c par arret er at each stage i s the square root of the 
nunber of stacks i nvol ved i n the convergence; that is the subsets of the routing plane that 
will be interconnecting between successive levels will be squares of stacks. At the final 
stage, the convergence wi 1 1 be one bi g square; at earlier stages, there will be nany srral 1 er 
squares of convergence. Fi gure 4. 5 shows t he i nt er connect i on t opol ogy of an i ndi rect bi nary 
cube usi ng bi nary s wi t chi ng conponent s. Usingbideltastacksforswitching, eachstackwill 
switchinl6, 64, or 256 di recti ons , and the i nt er- st age wiringwill occur in two di nensi ons 
rather than one. 

4. 2. 3 Wre Lengt hs 

W can appl y the anal ysis of Section4. lto anal yze the di s tri buti on of wi res by 1 ength 
inthis scheme as well. Here we have a sequence of rout i ng pi anes wi t h the i nt er connect 
bet ween pi anes . Inthis configuration, the wires are di st ri but ed wi th vari ous di spl acenent s 
a:and y di recti on and an es sent i al 1 y constant di spl acenent bet ween pi anes . Assuni ng the 
i nt er- pi ane di spl acenent is negl i gi bl e, we can get a feel for the di stri buti on of wi r e 1 engtl 

Here the distribution of di spl acenent s necessary in the a;and y directions are, to a 
first-order appr oxi nat i on, uni f ormdi ffer ence distributions as descri bed i n Secti on 4. 1. 2. 
Fromthi s, wecaneasilyderivethedistri buti on of wi re 1 erigbiistJieLifcterig'th of the 
side of the square of convergence at the stage of i nt ere^i 53 aogiacinaliVz e d t o st ack 
size, ipm 1 1 agai n be our di stri buti on f unqtwiAil be the sel ecti on f uncti on as before. 
Equati ons 4. 11 through 4. 13 surmari ze these rel ati ons . 

nax( / ) nax( / ) 

Pl(h) = E E \PJ*{&)- &(#)■ <h,<kdy) ] (4.11) 

dy=0 dx=0 

mx(I) = \V2N S ~\ (4.12) 

-<(t,4« = {I W^^l (4.13) 

(^ otnerwi se 
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Figure 4.6: El stri buti on of Normal i zed Wre Lengths=Boamilf64, Eespecti vel y. 





Figure 4.7: El stri buti on of Del ay Cyxl e,s =£ tfkaNd 64, Eespecti vel y usi ng< 1' 
Stacks . 

Fi gure 4. 6 shows the di s tri buti ons for nor rial izedside 1 engt hs of 8 and 64. If a 1 arge bi del t a 
network were bui It with three stack st ages , two bui fei^f racmLBhe 1 ast f r onj^g-g , 
N s woul d be 8 between the first and second stage and 64 bet ween the second and thi r d. 
Normal i zed average wi r e 1 engths are 4.6 and 33. s 9=f8mnft[ 64, respecti vel y. 

4. 2. 4 Efel ay Qrcl es 

The nunber of delay cycles required can be deteririned exactly as described in Sec- 
tion 4. 1. 3. Mki ng the same assurrpti ons as i n Sect i on 4. 1. 3, Fi gure 4. 7 paral 1 el s Fi gure 4. 6 
in terns of delaycycle uni t s . The si de 1 engt h for hftfeh aid -B2562 stacks is two 
feet. For this configuration, the average nunber of del ay cycles for i s s=18 &iwhen N 
6. 9 when 1$ =64. 

4. 3 I*fet vork Charact eri zat i oris 

Thi s sect i on surmari zes a f ew quant i zati ons for vari ous network charact eri sties par am 
eterized by network size. Qianti zati ons are pr ovi ded f or f ul 1 fat - tree, hybri d fat- tree, ai 
bi del t a networks . Thi s al 1 ows some corrpari son across the range of networks between f ul 1 
fat- tree and full bi del t a net works . For each network, characterizations are provided for 
re qui red hardware and network 1 at ency. 

Tabl e 4. 3 surrmari zes most of the vari abl es usedinthe remai nder of this secti on. 



57 



N„ 



N c 
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(m,o) 



J open 



-"connect 



nunber of processors 



nunber of routing conponents 



nunber of processors supported by a 1 eaf st ack 



nunber of rout i ng conponent s i n a bi del t a 1 eaf s t ac i 



nunber of bi del t a 1 eaf cl ust er s 



nunber of CF64* unit trees 



nunber of UTg4% unit trees 



a par arret er describing selection f reedorri 

i represent s the nunber of stack stages in a net work 

i mis t be a pos i t i ve i nt eger 



a par arret er describing selection f reedorri 

j represent s the nunber of r outi ng st ages i n a bi del 



total nunber of routi ng stages i n a bi del t a network 



1 engt h of a s i de of t he r out i ng chi p 



cl ock peri od 



length of longest wire 



wi re del ay 



speed of 1 i ght 



stage delays between stage m and o 



1 at ency openi ng connect i on f romsour ce to dest i nat: 



on 
1 at ency of an open connect i on f romsour ce to des ti nati on 



t a st ack 



N.B. For the f ol 1 owi ng, 1 at ency is used to refer to the periodof tine between the tine 
the message enters the network and the ti ire the message arri ves at the desti nati on. The 
actual latency of a network operation will be a function of this metric and depending 
on the end to end protocol used. Connection 1 at&jftftVj , Zdi ffer s fromthe latency 
openi ng a connect i on^pe^ due totheneedto change routi ng bytes duri ng the openi ng of 
a connect i on. 

Tabl e 4. 3: \ari abl e Surmary 

4.3.1 Rill Ikt-1-ee 

Eecal 1 f r omSecti on 3. 3. 4 that full fat -tree networks are bui It entirely f r omuni t trees 
Ful 1 fat -trees have processors as the ul ti nat e 1 eaves of the fat- tree structure. 

Size 

Full fat- tree net works are bui 1 1 ^tgh Wi t trees form ng the bottomrost stage 
of the fat- tree ands^ unit trees forning the remainder of the tree structure. Equa- 
tion 4. 14 char act eri zes the sizes of const rue ti ble full fat -trees. 



N„ =64' 



(4.14) 
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Afull fat -tree net work of agivensize will have 3i t ree levels nade f ronii stages of uni t 
trees . The fat -tree will be conposed of i up routi ng stages and 3i down routi ng stages . 

Hardwire Itequi Tenants 

Each CF 64* un i t tree re qui res 208 BN1 routi ng conponent s (Section3.3. 1) whi 1 e each 
IT 64^ re qui res 52 BN1 conponent s (Section3.3.4). Equati ons 4.15 and 4. 15 respectively 
descri be the requi red nunber ol^-IFand IT §i% unit trees necessary to bui 1 d ful 1 fat- 
trees of size, ^Secti on 3. 3. 4) . 

N 
Nfvrit = gf (4-15) 

N^ t = (l-4(«))^ (4.16) 

Oonbi ni ng these requi renent s , we find the total nunber of routing conponent s required 
as expres sed i n Equati on 4. 17. 

B. = (»,)(£) + (2 08)(l -4<»>) (i) = (i + ^^) nN p (4.17) 

Latency 

As describedinSection 3.6 there will be sons nunber of st age del ays bet ween stack 
stages . Let(^\ be the nunber of cl ock cycl es of del ay between st ack stage rrand stage 
a Between any pair of stages, the val u&s^ p^rand nfk-$.,k) ™ 1 1 be distributed as 
describedinSection4. 1. Si nee there is adistri but i on of pos si M e^abaatewefeaiir n 
the sane stack stages, no single value describes this quantity. The reverse stage delays. 
n (k-\i,k) ™ 1 1 be di stri but ed si ni 1 arl y; however, a paffit^^ulnffiEdinot be rel at ed t o 
the corresponding^.^ utilized as part of the sane interconnection. 

To route through a gi ven 1 evel of the ful 1 fat- tree, a connect i on mist route through one 
up routi ng 1 evel for each 3 1 evel s up the tree. The connect i on wi 1 1 then route through al 1 
the down 1 evel s . Bet ween stack st ages on both the up and down path, the route will suffer 
the addi t i onal stage del ays due t o the 1 ong wires be tween stages. The 1 at ency through 1 evel 
fcof tree i s gi ven by Equati on 4. 18. 



J corwect 



(k) 



'-„■ 

3 

V 
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xt c (4.18) 



/ 



In openi ng a connecti on through 1 evel k, an additional cycl e of del ay i s i ncurred bet ween 
stacks on the down route i n order to s wal 1 ow the 1 eadi ng routi ng byt e. Si nil arl y, at 1 east 
one up routi ng byte will need to be dropped in crossing over to the down routi ng path; 
every fourth stage between uni t trees requi res an addi ti onal stage of del ay for the 1 eadi ng 
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rout i ng byte t o be dropped. The 1 at ency to open a connect i on through 1 evel kof the fat- tree 
is suimari zed i n Equati on 4. 19. 
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The best- case rout i ng occurs when the most localitycanbe expl oi t ed. Thi s occurs when 
a processor connects to its nearest neighbor. In this case, the route occurs through level 
one of the fat- tree. Thus Equati ons 4. 20 and 4. 21 describe the best-case performance for a 
full fat -tree. 



^covcnect 
-LJopen 



2t t 



(4.20) 
(4.21) 



In the war st- case, the onl y comum ancestor of the source and desti nati on pr oces sor s 
is the tree root. Inthis case, the connect i on mist travel al 1 i stages up to the root of the 
fat- tree and t hen al 1 3i stages back down to the desti nati on 1 eaf . Equati ons 4. 22 and 4. 23 
give the worst-case latencies for the full fat-tree. 



J cormect 



J open 
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(4.22) 
(4.23) 



4.3.2 IJrbridIkt-1-ee 

Hybri d fat -trees are bui It withbideltaleaf clusters forningthe 1 eaves of the f at- 1 re 
s t r uc t ur e . 

Size 

The 1 eaf stage of the network i s bui It wi th t he bi del t a 1 eaf cl ust er s des cri bed i n Sec- 
t i on 3. 1. These are i nt er connect edwitha fat -tree structure cons truf|edrfirtomCF 
trees. Each leaf cluster is built from; routing stages and;^ppDaj(tessA6r s as 
des cri bed by Equat i on 4. 24. 

Ni eaf =3- i j ^ (4.24) 

For t echni cal reasons describedinSection3.1.3, 1<J<4. tfeingi— 1 s^^ges of IT 
uni t trees , the t ot al number of pr oces sor s supported i s thus gi ven by Equati on 4. 25. 



N p =N leaf - 6^ 4 ) =3- 4 J ' 4 ) • 64 i4 ) 



(4.25) 



The fat- tree structure provi des i — 1 up r outi ng st ages and 3(i — 1) treelevels and hence 
down r out i ng s t age s . 
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Hardwire Itequi Tenants 

Each leaf is construct ed fr$mi\RNl routing conponents as descri bed by Equa- 
tion 4. 26. 

Nc leaf =4( J ' 4 ) +(j- 1)3 • ^) =(3j+l)4(^) (4.26) 

The total nunber of bi delta leaf cluster required is described by Equation 4.27 (Sec- 
tion 3. 3. 3) . 

N bide ita= ^- =64( i4 ) (4.27) 

1*1 eaf 

The nunber of uni t trees re qui red for a hybri df at -tree wi th i — 1 uni t tree stages is gi ven 
by Equation 4. 28 (Section 3. 3. 3). 

*-, = (£,)(!-*»>) (4.28) 

Each IT 64 « uni t tree is constructed f rom208 BN1 conponent s (Section3. 3.1). Oonsol i - 
dati ng these r el at i ons , we get Equati on 4. 29 whi ch descri bed the t ot al nunber of rout i ng 
conponent s i n a hybri d f at - 1 ree. 

N c = (3j+l)4(^) • 6^f 4 ) +208 (^\ (l-4( 14 )) 

= "(3j+l)4(^) +^4^' 4 ) (l-4( 14 ))J64( i4 ) (4.29) 

Latency 

The best- case interconnect in the hybri d fat- tree occurs when a connect i on i s imde 
wi thi n the bi del t a 1 eaf cl us t er . Wen this level oflocalitycanbe expl oi t ed, the 1 at ency f < 
a connect i on i s gi ven by Equati on 4. 30. 



-"connect 



-L open =jxt c (4.30) 



Openi ng and conti nui ng a connect i on are i denti cal i n thi s case because j i s constrai ned t o 
not exceedfour due to current t echnol ogy 1 i ni t ati ons . 

Wen coimuni cati ons do need to occur between process or s indifferent leaf clusters, 
connect i ons mist be imde through the fat -tree. As connect i ons are rout ed between stack 
stages, delaycycles will be incurred as descri be df or the full fat -tree (Section4. 3.1). 
r out i ng t hr ough sons stage of the fat -tree, the c onne ctionfirst re qui res one cycle to r out e 
i nt o the fat- tree network. The connect i on traverses one up routi ng stage every three stages 
up the tree it mist go. Qice the connect i on reaches the root of the smallest comron 
sub- tree bet ween the source and desti nati on pr oces sor , the connect i on mist then be routed 
down al 1 the down routi ng stages to the desiredbideltaleaf stack. Qice the connect i on 
final 1 y enters the desti nati on 1 eaf st ack, there wi 1 1 be an addi ti onal j stages of routi ng wi t hi 
the 1 eaf stack. Equati on 4. 31 surmari zes the 1 at ency of a connect i on through 1 evel fcof the 
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fat- tree network. 
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Wen openi ng a connect i on, additional delaycycles are requi red due to the need t o change 
rout i ng bytes . The hybri d fat- tree requi res t he addi t i onal cycles for s ml 1 owi ng bytes on 
the up routing path, down routing path, and in the crossovers as described for full fat- 
trees (Section4. 3.1). Si nee t wo bi t s of routi ng are requi red to route into the fat -tree tl 
s wal 1 owon the up path wi 1 1 occur one st ack stage earlier inthis configur ati on than i n t he 
full f at- tree configur ati on. The hybri d fat- tree wi 1 1 requi re an addi ti onal change of routi n 
bytes when the connect i on enter s the desti nati on 1 eaf cl ust er . Equati on 4. 32 conbi nes these 
addi ti onal del ays to descri be the 1 at ency for openi ng a connect i on through 1 evel A: of the 
fat -tree portion of a hybr i d fat- tree. 
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Wr s t- case connect i ons i n t he hybri df at -tree occur when i nt er connect i on mist be nade 
through the top 1 evel of the t op st age of uni t trees. Here the connect i on i s nade i nt o the 
fat- tree, up the i— 1 up routi ng stages to the r oot , down the 3(i— 1) stages toaleaf cluster, 
then across the j stages of the 1 eaf cl ust er to the desti nati on processor . Equati ons 4. 33 and 
4. 34 descri be the 1 at ency for thi s worst- case i nt er connect i on. 
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(4.33) 
(4.34) 



4.3.3 H del ta I*t work 

Section4. 2 describedthe constructi on of 1 arge bi del t a networks . Thi s is a general i z ati oi 
of t he Tr ans it bideltanetworkto mil tiple stack sizes. 



Size 

The const i tuent st ack stages are each constructed wi th j\ ayer s of routi ng. i stack stages 
can then be conbi ned t o construct the 1 arge bi del t a network. The / s for each st age i n a 
si ngl e network need not be i denti cal acros s al 1 stack stages . Thus the nunber of processors 
i n such a network is descri bed by Equati on 4. 35. 



iV„=4- ?1 



li 4 
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Eecal 1 f r omSecti on 4. 2. 1 that size and performance cons trai nt s 1 i ni t j as gi ven by Bel a- 
t i ons 4. 36 and 4. 37. 

for mi: 1 < j m < 3 (4.36) 

2<j { <4 (4.37) 

The t ot al nurrber of rout i ng stages i n t hi s network i s sumrnri zed i n Equati on 4. 38. 



h=j i+J2+- ■ ■ ^rlog4(iV p ) (4.38) 
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The nunber of r outi ng conponent s needed can be det e ruined by 1 ooki ng at the enti r e 
net work wi thout a need t o do i ndi vi dual accounting at each stack st age. Cons i deri ng the 
bandwi dt h i n and out of the begi nni ng and end of the network, Equati on 4. 39 gi ves the 
nunber of routing conponent s needed at each routing stage. 

Nc^ e =^ (4.39) 

Oonbi ni ng thi s wi t h Equati on 4. 38, the tot al nunber of RN1 routi ng conponent s re qui red 
for the bi del t a net work vi 1 1 be det erni ned by Equati on 4. 40. 

N c = hx -^ 
4 



^ lo a(^) (4.40) 



4. 3. 4 Lat ency 



Eel ay stages are i ncur red bet veen sue cess i ve stack stages as previ ousl y descri bed. Sec- 
tion4.2.3 describes the distri buti on char act eristics the extra de 1 ay eye 1 es re qui red bet ve 
stack stages. Since the set of source processors is the same as the destination set, con- 
nect i ons mist 1 oop- back f romthe end of the bi del t a net work to the begi nni ng. Si nee thi s 
re qui res onl y transl at i on i n the vert i cal st ack di rrensi on, we wi 1 1 assume t hi s 1 oop- back can 
occur in a si ngl e cl ock cy^L^X-f Nate that si nee routi ng onl y occurs in one di recti on 
through the bi del t a net work, unl i ke the tree structures , the del ays between stack st ages i s 
onl y i ncur red once inthis structure. 

Wth the except i on of the nunber of delaycycles i ncur red between stages due to 1 ong 
wires, all paths through the bi del t a net work are the sane length. Each path traverses 
the /irouti ng stages descri bed above. Int er- stage del ay is i ncur red as well as acycle for 
the final 1 oop- back. As such, the connecti on 1 at ency f or a bi del t a net work i s describedby 
Equati on 4. 41. 

Lcormect = [h+ ^n(fe,fe+l) +1 j X t stack ( 4 - 41 ) 

In openi ng a new connect i on through the bi del t a net work, additional 1 at ency occur s as a 
result of changing routing bytes. Eouting bytes will be changed bet ween routi ng stack 
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Table 4.4: Rill Fat - Tree Hardware Bequi rement s 

stages as describe inSection4. 2.2. Die to the const rai nt s pi ace d on j, there wi 1 1 never 
be a need t o change r outi ng bytes wi thi n a bi del t a stack. Equat i on 4. 42 gi ves the 1 at ency 
re qui red to open a connect i on through the bi del t a network. 
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4.4 Gbnpari sons 

In or der to offer a cl earer conpari son between the vari ous networks descri bed her ei n, 
this sect i on provi des some concrete exanples. Using the equations and as s unpt i ons from 
Sect i on 4. 3 and the georretri c configur ati ons descri bed i n Secti ons 3. 4 and 4. 2, represen- 
tative nunbers are det ernined f or network size, composition, and latency. All values are 
char act eri zed as describeinSection4. 3. 

4.4.1 Rill Iat-1-ee 

Tabl es 4. 4 surmari zes the hardware re qui renents for afewfull fat -trees of interesting 
sizes. Table 4.5 characterizes the extreme latency cases for the full fat-trees described 
Tabl e 4. 4. The worst- case 1 at ency incl uded here uses the worst- case wire del ays between 
unit tree stages on both paths through the fat- tree and assumes the connect i on mist be 
made through the tree root. EeviewSection4. 1 to get afeel for the distri but i on of 1 at enci es 
between the best and worst cases . 

4.4.2 I$rbridlat-1-ee 

Tabl e 4. 6 descri be s several sizes of hybri d fat -trees. Tabl e 4. 7 surmari zes the 1 at ency 
range for these configur ati ons . Jus t as f or t he ful If at -trees the worst - case 1 at ency shown i : 
for a connect i on through the tree root withthe 1 ongest possible wires betweenstack stages 
on both the up and down tree traversal . Agai n, Secti on 4. 1 descri be s the di st ri buti ons of 
wire delays for configurations such as these. 
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Ninber of 
R*ocessors 


latency of Connect 
"VSfrst Ckse Bfest Ckse 


latency of Qien 
"Wrst Ckse Bfest Ckse 


64 


40ns 


20ns 


50ns 


30ns 


4,096 


100ns 


20ns 


120ns 


30ns 


262, 144 


200ns 


20ns 


230ns 


30ns 



Table 4. 5: Ril 1 Fat - Ttee Lat enci es 



Ninber of 
R*ocessors 


H delta 
Stacks 


Ihit Tree 
Stacks 


T>tal 
Gbnponent s 


768 


64 B 12 


1 IF 64% 


656 


3,072 


64 £48 


4 ur 64% 


3,392 


12,288 


64 #192 


16 lT 6 i* 


16,640 


49,152 


4096 Bl2 


80 IF 6 4% 


45,312 


196,608 


4096 ^ 8 


320 IF 6 4* 


230,400 


786,432 


4096 ^92 


1280 IT M * 


1,118,20:5 



Tabl e 4. 6: Hybi rd Fat- Ttee Hardware Be qui renent s 



Ninber of 
R*ocessors 


latency of 
"\Srst Ckse 


Connect 
Hfest Ckse 


latency o 
"\ftfrst Ckse 


f Qien 
Hfest Ckse 


768 


90ns 


20ns 


110ns 


20ns 


3,072 


100ns 


30ns 


120ns 


30ns 


12,288 


130ns 


40ns 


150ns 


40ns 


49,152 


170ns 


20ns 


200ns 


20ns 


196,608 


200ns 


30ns 


230ns 


30ns 


786,432 


290ns 


40ns 


320ns 


40ns 



Table 4.7: Hybri d Fat- Tbee Lat enci es 



4.4.3 lull H delta 



Tabl es 4. 8 and 4. 9 suimari ze the char act eri sties for some bi del t a networks of conpa- 
rabl e size to the full fat -tree and hybri d fat- tree configurat i ons 1 i st ed above. For t hi s cas 
the worst - case 1 at enci es are for the case i n whi ch the wi res are naxi nal 1 ength between 
stages. The best-case latencies assume a single cycle of delay due to inter-stage wiring. 
Section4.2.3 describes the delaycycle distri buti ons for thi s configurati on. 
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Ninber of 
R*ocessors 


H delta 
Stacks 


T>tal 
Gbnponent s 


64 


#642 


48 


256 


-#256:2 


256 


1,024 


-#64* #162 


1,280 


4,096 


#64* #642 


6,144 


16,384 


-^64* #256 2 


28,672 


65,536 


#64 * #64 * #16:2 


131,072 


262, 144 


#64* #64* #642 


589,824 


1, 048, 57'3 #64* -#64* #2562 


2,621,440 



Tabl e 4.8: Hdelta Network Hardware Be qui renent s 



Ninber of 
R*ocessors 


latency of 
"VSfrst Case 


Connect 
Hfest Case 


latency o 
"Wrst Case 


f Qien 
Hfest Case 


64 


30ns 


30ns 


30ns 


30ns 


256 


40ns 


40ns 


40ns 


40ns 


1,024 


70ns 


70ns 


80ns 


80ns 


4,096 


90ns 


80ns 


100ns 


90ns 


16,384 


110ns 


90ns 


120ns 


100ns 


65,536 


160ns 


110ns 


180ns 


130ns 


262, 144 


210ns 


120ns 


230ns 


140ns 


1,048,57 


5 310ns 


130ns 


330ns 


150ns 



Table 4.9: Bi del t a Net work Lat enci es 

4.5 Rmting Statistics 

Usi ng the si npl y probabi listic node Is for rout i ng anal ysi s devel oped i n [ Kii ght 90] , 
this section provides sons basic routing statistics for the network topologies describee 
here. St ati st i cs are provi ded f or al 1 of the configur ati ons det ai 1 ed i n the previ ous secti o: 
All statistics surmari zed here are based on the network no del presented in Secti on 1. 2. 
Thi s means that each processor wi 1 1 use at most one of its t wo i nput s i nt o to the net work 
at a gi ven poi nt i n t i ne. The probabi listic node Is used to derive these statistics si npl y 
characterize rout i ng probabi 1 i ty i n term of net work 1 oadi ng. The effect of input queuing 
at each source is not modeled. 

Trafffc distribution will have a considerable effect on the actual performance of any 
of these networks. The fat-tree structured net works require considerable comruni cati on 
localityin order to perf ormopti rial 1 y. Each net work i s consi deredwiththe 1 oadi ng di stri bu- 
t i ons for whi chit is most favor abl e. Al ong wi th the r outi ngstatistics, this secti on provi d< 
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char act eri zati ons for the 1 ocal i t y re qui red by each network configurati on to achi eve near 
optinal perfornance. 

4.5.1 Rill Ikt-1-ee 

Afull fat -tree net work i s opti ni zed to take advantage of 1 ocal i t y. The actual construc- 
ti on and bandwi dth al 1 ocati on det erni ne the 1 ocal i t y char act eri sties necessarytoutilize th 
net war k most effect i vel y. The opti rial 1 oadi ng case occurs when each connect i on through an 
BN1 r outi ng conponent i s equal lylikelytobe nade through each of the conponent ' s out - 
put s . Wt hi n a uni t tree structure, this means that a connect i oni s equal lylikelyto crossover 
at each 1 at eral crossover st age. In a uni t tree stage other than the root , the connect i on fur- 
ther up the tree will alsobe utilizedwith equal 1 i kel i hood as each 1 at eral crossover . Taki n 
these two cons i der at i ons together, t he di stri buti on of connections through each tree 1 evel 
in the sane unit tree is r oughl y e qual ; the di stri but i on of connections through unit tree 
stages drops off by a fact or of four at eachsuccessive stage toward the root . 

There are two ways of 1 ooki ng at the 1 ocal i t y di s tri buti on. As descri bed, we can t hi nk 
of the di s tri buti on in term of the probabi 1 i t y of routi ng through a gi ven hei ght i n the 
fat- tree. Thi s is the pr obaby^j (tj^„ Miat aprocessor will need to cormuni cat e wi th 
(»i/processor a fixed di stance, r\ away f r omthe source processor . The probabi 1 i ty that the 
processor will need to corrmmi cat e wi th apatialar^vocessoT a fixed di stance f ormthe 
source pr oces sor is an al t ernat e way of vi ewi ng thi s 1 ocal i ty. Thi SpPi/jhj^b^^i, t y, P 
i s si npl y the previ ousl y nenti oned probabi 1 i t y nor rial i zed by the nunber of processor that 
are located the fixe d di s t anc e away from the source processor. Table 4.10 through 4. 12 
sumrari zes the localitydistri buti ons as suned for the f ul 1 fat- trees consi de red here in tern; 
of these 1 ocal i t y measures . 

Wththeselocalitynodels, Fi gures 4. 8 through 4. 10 show the routi ng s t ati sti cs on each 
of these full fat -trees. Each graph i ncl udes a separate curve for the routi ngstatistic throuj 
each 1 evel of the fat- tree. The topmost curve char act eri zes the statistics for a connect i o: 
through the first 1 evel of the fat- tree. Successi ve curves down the graph gi ve routi ng st at i s- 
ti cs for routi ng through successi ve tree levels; the bot t omrost curve char act eri zes routi ng 
through the tree root . 

Fi gure 4. 11 shows the nor rial i zed routi ng probabi lities for these three sizes of full fat 
trees . The nor rial izedstatistics are r oughl yidentical for the three sizes, so the i ndi vi du 
normal i zed curves are i ndi sti ngui shabl e. The normal i zed probabi lities gi ve ani dea of overal '. 
network perfornance under the as sumed 1 oadi ng condi ti ons . The nor rial i zed probabi lities 
are det erni ned by wei ghti ng the probabi 1 i t y of routi ng through a gi ven tree 1 evel by the 
probabi 1 i ty that a connect i on can successful 1 y be made through that tree level as surma - 
ri zed i n Equat i ons 4.43. 

nax(n) 
"warm = / y -*. any \ ty ' -fovte(ty (,'*.4oJ 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 8: Bouti ng St ati st i cs for Ful 1 Fat- Tree (64 processors) 

If 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 9: Bouti ng St ati st i cs for Ful 1 Fat- Tree (4096 processors) 
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Fi gure 4. 10: Bouti ng St ati sti cs for Full Fat- Tree (262144 processors) 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 11: Norrral i zed Bouti ng St at i sti cs for Ful 1 Fat - Trees 
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n 


"any \ ty 


-Lparticid ar\ Ty 


1 


0.333 


0. Ill 


2 


0.333 


2.78 X 10 s 


3 


0.333 


6.94 x 10 s 



Tabl e 4. 10: Local 1 i ty Structure of Pull Fat- Trees (64 Processors) 



n 


"any \ ty 


-Lparticid ar\ Ty 


1 


0.278 


9.27 X 10 2 


2 


0.278 


2.32 x 10 2 


3 


0.278 


5.79 X 10 s 


4 


5.53 X 10 2 


2.88 X 10 4 


5 


5.53 X 10 2 


7.20 X 10* 


6 


5.53 X 10 2 


1.80 X 10* 



Table 4. 11: Locallity Structure of Rill Fat- Trees (4096 Processors) 



n 


"any \ ty 


-Lparticid ar\ Ty 


1 


0.274 


9.13 X 10 2 


2 


0.274 


2.28 X 10 2 


3 


0.274 


5.71 X 10 s 


4 


3.17 X 10 2 


1.65 X 10 4 


5 


3.17 X 10 2 


4.13 X 10* 


6 


3.17 X 10 2 


1.03 x 10* 


7 


1.06 X 10 2 


8.62 x 10* 


8 


1.06 X 10 2 


2.16 X 10* 


9 


1.06 X 10 2 


5.39 X 10 s 



Table 4. 12: Local 1 i ty Structure of Full Fat- Trees (262144 Processors) 

4.5.2 IJrbridIkt-1-ee 

The hybri d fat - tree also takes advant age of 1 ocal i t y nuch 1 i ke the f ul 1 fat- tree. Si nee 
onl y one- quarter of the bandwi dt h out of the first stage connect s to the fat -tree structure, 
opti nal per for nance occurs when three- quarters of the processor i ni t i at ed traffb is local to 
the bi del t a 1 eaf cl ust ers . Wthi n the 1 eaf , the di st ri buti on of traffb i s uni f orrr} that is, i 
equal 1 y 1 i kel y t o need t o connect to any of the proces sors wi thi n the 1 eaf. For connect i ons 
through the tree, the ar range nent is i dent i cal to that of afull fat -tree so the distri buti c 
of request s by tree level will progress inthe same nanner . Tabl es 4.13 and 4. 14 surmari ze 
t he di s t r i but i ons ass une d f or t he hybr idfat-tree networktr afffc . 
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n 


"any \ ff 


-Bl2 


-Lparticvl ar\ T*} 


-Bl2 


(leaf 


) 0. 75 


6.82 x 10 2 


1.60 X 10 2 


3.93 X 10 s 


1 


8.33 X 10 s 


2.31 X 10 3 


5.78 X 10 4 


1.45 X 10 4 


2 


8.33 X 10 2 


5.78 X 10 4 


1.45 X 10 4 


3.62 x 10* 


3 


8.33 X 10 2 


1.45 X 10 4 


3.62 x 10* 


9.04 x 10 6 



Tabl e 4. 13: Local 1 i t y Structure of Hybri d Fat- Trees (Si ngl e Uni t Tree Stage) 



n 


-tony \ ") 


-Bl2 


-Lparticvlary "} 
#48 


-B12 


(leaf 


) 0. 75 


6.82 x 10 2 


1.60 x 10 2 


3.93 X 10 s 


1 


6.95 X 10 2 


1.93 X 10 s 


4.83 X 10 4 


1.21 X 10 4 


2 


6.95 X 10 2 


4.83 X 10 4 


1.21 x 10 4 


3.02 x 10* 


3 


6.95 X 10 2 


1.21 X 10 4 


3.02 x 10* 


7.54 x 10 6 


4 


6.95 X 10 2 


3.02 x 10 s 


7.54 x 10 6 


1.89 X 10 6 


5 


6.95 X 10 2 


7.54 x 10 6 


1.89 X 10 6 


4.71 X Iff 


6 


6.95 X 10 2 


1.89 X 10 6 


4.71 X 10* 


1.18 x 10* 



Table 4. 14: Locallity Structure of Full Fat- Trees ( Two Uii t Tree Stages) 

Fi gures 4.12 through 4. 17 show the routi ng st at i sti cs for the hybri df at -trees described 
i n Secti on 4. 4. 2. The first three graphs are for the configurations with one stage of unit 
trees, while the later three graphs are for the two tree stage configurations. The topmost 
curve in each graph represent s the routi ng probabi lities withinthe bidelta cluster. Each 
successi ve curve down a graph shows routi ngstatistics for connect i ng through asuccessively 
higher level in the tree structure. 

Figure 4.18 shows the nornal i zed rout i ng probabi 1 i ti es for these si x hybri d fat - tree 
configurations. The nornal i zed stati sti cs for hybri d fat- trees with one or two unit tree 
stages coincide as long as the size of their bidelta leaf clusters is identical; that is 
normal izedstatistics differ onl y by the size of the bi del t a 1 eaf cl ust er . The t opnost curve 
shows statistics for hybri d fat- tree s^ubiAgl £a 1 eaves , the ni ddl e .feobi Be 1 1 a 
leaves, and the bot tomfqg2 i?eaves . 

4.5.3 Rill H delta 

To a processor on a full bidelta network, all destinations are equally distant. The 
routi ng t o al 1 dest i nat i ons i s t opol ogi cal 1 y i dent i cal . The bi del t a network wi 1 1 pr ovi de i 
best over al 1 per for nance when message traffb i s uni forrrLy di stri but ed across all processors; 
that is, its most f avorabl e 1 oadi ng condi ti on occur s when each source wi 1 1 open a connect i on 
t o al 1 destinations with equal likelihood. This flat, randomdi stri but i on of connections is 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 12: Bouti ng St ati s ti cs for Hybri d Fat -Tree (768 processors) 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 13: Bouti ng St ati s ti cs for Hybri d Fat -Tree (3072 processors) 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 14: Bouti ng St ati s ti cs for Hybri d Fat -Tree (12288 processors) 

It 




0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 15: Bouti ng St ati s ti cs for Hybri d Fat -Tree (49152 processors) 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 16: Bouti ng St ati s ti cs for Hybri d Fat -Tree (196608 processors) 

It 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 17: Bouti ng St ati sti cs for Hybri d Fat- Tree ( 786432 processors ) 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 18: Nor rial i zed Bouti ng St at i sti cs for Hybri d Fat- Tree 



essentially the extreme case in whi ch no locality is exploited. As traffb deviates from 
this flat distribution, the performance will deteriorate. Figure 4.19 shows the routing 
statistics for the bidelta networks described in Section 4. 4.3. The topmost curve in the 
figure corresponds to a 3 stage, 64processor, bi del t a net work. Each successi ve curve down 
t he gr aph plots statistics for anetworkwithan addi t i onal stage of r out i ng , naki ng t he 
t ot al net work size a fact or of four 1 ar ge. The bot t orrmDst curve thus corresponds to a 10 
stage net work supporti ng 1, 048, 576 processors. 
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0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Probability that a Source is Transmitting 

Fi gure 4. 19: Bouti ng St ati s ti cs for Ful 1 Bi del t a Networks of \&ri ous Si zes 
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Gbncl usi on 



Chapters 2 and 3 presented the topology and construction details for constructing fat- 
tree style networks using Transit technology. Chapter 4 then s unmar i z e d nany of the 
parameters and char act eristics i npl i ed by the network structure i n or der t o provi de a basi s 
for conpari son wi t h other ar chi t ectures . Thi s chapter surmari zes some of the re qui r ement s 
and consequent s of these fat -tree structures and i dent i fie s component s of the desi gn whi ch 
night merit further study. 

5.1 Bout i ng Gbnponent Bequi reliant s 

The manner i n whi ch the BN1 routi ng component coul dbe utilizedas the primitive 
rout i ng el ement i n net war k const ructi on was always a pri nary cons i der at i on throughout 
thi s work. 

In order to build a maximally fault tolerant network structure, two properties of the 
BN1 conponent were identified as critical. The ability to separat el y configure the byte 
dropping characteristics of each input port was est abl i shed as necessary for opt i rial dis- 
persion of connect i ons (Section 3. 1. 2 and 3. 3. 5) . The al t ernat e configurat i on of BN1 as an 
i nde pendent pai r of 4x4 cros sbar s wi t h one output in each 1 ogi cal di recti on proved cri ti cal 
to the constructi on of a network resilient agai nst si ngl e conponent f ai 1 ur es . Thi s al so hel pi 
irini ni ze the effect s of each conponent f ai 1 ure ( Sect i on 2. 1. 2 and 3. 3. 4) . 

For t hi s ne t wor k s t r uc t ur e t he onl y f unc t i onal i t y t hat islackingfr omt he c ur r e nt BN1 
desi gn i s the abi litytodeal with round trip del ays on 1 ong wires (Section3.6). Thi s abi lit 
turns out to be necessary for opti rial per for nance when bui 1 di ng any net works of these sizes 
re gar dl ess of whether the network i s arranged as abideltaor fat -tree network (Secti on 3. 6 
and 4.2). Section3.6describes aschemeforrectifyingthis deficiencyinfuturerevisi ons c 
EN1. 

5.2 Characteri st i cs 

The network desi gn present ed overcomes the pot enti al pr obi ens i denti fiedi n Secti on 1. 1 
due t o net work si ze, deconposi ti on, andtopology. Anunber of i nteresti ng structures were 
devel oped to surmount these probl ens . The result is a fat -tree net work structure that has 
a nunber of desi rabl e pr oper ti es . 

5.2.1 Gbnstructabl e 

An overri di ng concern i n the devel opment of the physi cal network structure was est ab- 
1 i shi ng a desi gn whi ch coul d be physi cally realized in the real worl d. At enti on was pai d 
to the real worl d constr ai nt s posed by t echnol ogy 1 i ni t ati ons . A so, some care was taken 
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to mnimze the constructi on conpl exi t y. The i nt ent was to desi gn a structure whi ch coul d 
realisticallybe f abri cat ed wi t hi n t he next few years wi thout rel yi ng on any t echnol ogi cal 
breakthroughs. At the sane tine, attention was paid to technology trends so that the 
archi tecture itself woul d still prove val uabl e wi t h i nproved t echnol ogi cal capabi 1 i ti es . 

The uni t tree structures (Section 3.3) provi de a powerful conponent for dealing with 
technological size linitations. They provide a deconposi ti on of the large network into 
conponent s that can be rel i abl y f abri cat ed. The uni t tree also serves as a bui 1 di ng bl ock, 
1 i ni ti ng desi gn conpl exi t y. Si nee each uni t tree is vi rt ual 1 y i denti cal , the conpl exi t y o 
configuring a network is reduced to that of configuring a couple of unit trees and then 
i nter- connect i ng uni t trees accordingly. 

The hoi 1 owcube geometry (Section3.4.3) provi des a si npl e and str ai ghtf or war dnanner 
i n whi ch t o arrange uni t trees. Its structure reflects the natural growth char act eristics oi 
the fat- trees constructed wi th uni t trees. Wthcareful design(Section3.5), the hoi 1 owcube 
structures can provi de a framework for the i nter connect i on of uni t trees; inthis nanner , 
they will significantly simplify the complexity of configuring and maintaining unit tree 
i nt e r c onne c t i ons . 

5.2.2 lault Tblerance 

Utilizingthe properties of BN1 and j udi ci ous wi ri ng pat terns , the fat - tree net work wi 1 1 
be reasonabl y f aul t t ol erant . All the fat -tree structures describedareresilient agai nst si 
component f ai 1 ures ( Secti ons 2.1.2 and 3.3.4). Wththe redundant paths through the net- 
work, the effect s of any component f ai 1 ures are ni ni ni zed. Secti ons 2.1.1 and 2. 3 presented 
cons trai nt s on wi ri ng t o naxi ni ze the f aul t t ol e ranee of these net works . Addi ti onal 1 y, the 
accessibility afforded by the hoi 1 owcube structure will all owf aul t s t o be r epai red whi 1 e the 
network i s i n ope rati on ( Secti on 3. 5. 4) . 

5. 2. 3 Cheap Bout i ng 

The network is structuredto nake rout i ng on the fat- tree both conceptual 1 y and prac- 
tically si npl e. Thi s al 1 ows rout i ng to be performed cheapl y as describedin Secti on 3. 8. 

5. 2. 4 Ferf or nance 

The resulting fat -tree based net works at t errpt t o ni ni ni ze the 1 at ency of the i nt er con- 
nect whi 1 e naxi ni zi ng the pr obabi lity of perform ng a successful rout e. 

Latency i s i nproved over a nai ve approach i n a nunber of ways . Eouti ng on the upward 
tree path (Secti on 2. 2. 4) reduces by a fact or of three the nunber of stages of rout i ng i ncurred 
whi 1 e travel ing up the tree. Utilizingthe BN1 component , whi ch i s a constant sizedswitch, 
keeps the routing delay at each stage constant at the cost of incomplete concentration 
(Secti on 2.2.4). Miki ng the rout i ng clock cycle insensitive to the si gnal del ays i ncurrec 
while crossing long wires, allows fast pipelining of data transni ssi ons ; a long wire onl j 
affect the 1 at ency of a connect i on whi ch actual 1 y traverses it (Section3.6). 

Wen bui 1 di ng networks of the nagni tude describedhere, it is clear that locality will 
be necessary t o obt ai n reasonabl e performance. The fat-trees construct ed fromuni t trees 
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have a natural architectural 1 ocal i t y resul ti ng f romthe allocation of hardware. This nat- 
ural 1 ocal i t y i s suimari zed for full fat -trees in Section 4. 5.1 and for hybri d fat -trees 
Sect i on 4. 5. 2. Knowi ng thi s opti nal level of locality nay prove useful i n desi gni ng paral 1 el 
al gori thrrs and software for 1 arge systems . Wth proper 1 ocal i t y, the routi ng statistics fo: 
these fat -tree structures is reasonabl e; the probabi 1 i t y of obt ai ni ng a succes sf ul rout e i ] 
f ul 1 y 1 oaded network is inthe 70%to 80%range even for net works wi th three- quart er s of a 
nilli on processors (Secti ons 4. 5. 1 and 4. 5. 2) . 

5 . 3 Rit ur e 

I have at t errpt ed to devel op the constructi on of these fat -tree structures in reasonabl e 
det ai 1 . Suffbi ent det ai 1 was provi ded t o demons trat e the f easi bi 1 i t y of such networks . Al so, 
this devel oprrent gives enough information about the structure that good estimates of 
critical parameters such as har dware requi rerrent s and performance can be obt ai ned. This 
development, however, is by no means definitive. Anunber of issues are open for further 
s t udy and opt i ni z at i on whi le afewissues re qui r e addi t i onal specification. 

5. 3. 1 Bout i ng Stat i st i c Mdel i ng 

The routi ng statistics provi ded in Secti on 4. 5 model the probabi lity of successfully 
finding a route through the network as a function of network loading. As such, it does 
not t ake i nto account the manner i n whi ch the net work wi 1 1 normal 1 y used; in general, 
when a mes sage fails to get routed, the processor will resendit later. Thi s wi 1 1 have a 
feedback effect on the network loading. In order to see network performance under this 
more re alistic model of network t raffb, a more detailedstatistical model mist be uti 1 i zed 
whi ch takes i nput queui ng i nto account . Addi ti onal 1 y, thi s anal ysi s woul d provi de a means 
for estimating the amount of time required, on average, in order to acquire a complete 
connect i on through the network. 

5.3.2 Si nidations 

As a corrpl ement to statistical modeling, it will be enli ght eni ng to si mil at e thi s net- 
work structure. Thi s wi 1 1 provi de a good means of checki ng the val i di ty of the st ati st i cal 
assumptions under various loading conditions. 

5.3.3 Interconnect i on Efetai 1 s 

Sect i on 2. 3 provi ded a number of constraints necessary t o obt ai n good performance. 
As menti oned there, it is unclear whether or not the expansi on properti es proposed by 
Lei ght on and Miggs shoul d be used to further di ct at e the det ai 1 s of i nt er connect i on wi ri ng. 
Che e we have f unc t i onal s i mil at i ons , we s houl d be abl e to est abl ishthe import anc e of t he s e 
constraints. Qice this is known, det ai 1 ed wi ri ng pat terns mist be developed for the unit 
tree structures. 

The wiring between stack stages meri t s addi ti onal at tent i on and det ai 1 . If opti cal i nt er- 
connecti on i s to actual 1 y be used, addi ti onal study on the i nt egrati on of thi s t echnol ogy wi 1 ] 
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be re qui red. Aidi t i onal work on s cherres for adapt i ve al i gnnent wi 1 1 be a vi rtual necessity 
i n order t o uti 1 i ze free- space i nterconnecti on on t hi s s cal e. For naxi rial efffci ency, custom 
conponent s nay need to be desi gned and f abri cat ed for t hi s purpose. 

5. 3. 4 Geonetry 

As statedinSection3.4.6, the hoi 1 owcube geometry i s not known to be opti rial . It s pos- 
si bl e future study nay produce a geometry that provi des short er i nterconnecti on di stances 
for the basi c network structure descri bed whi 1 e ret ai ni ng the propert i es of nai nt ai nabi 1 i ti 
and c ons t r uc t i bi 1 i t y. 

5.3.5 Racket Switching 

The basi c Trans i t net worki ng scheme is circuit switched. Qrcuit switchingis usedto 
nini ni ze the 1 at ency i nherent i n get t i ng a response f roma remote processor on the net work. 
Using circuit switching, no buffer i ng i s needed wi t hi n the net work. Thi s avoi ds probl ens 
due to internal buffer overflowor net work conges ti on by bl ocked packet s . 

For large networks where the latency from one end of the network to the other is 
greater than the ti me re qui red to transni t the standard quant a of data, circuit switching 
may inefftientlyutilize net work bandwi dt h. If such i s the case, it ni ght be worthwhi 1 e t o 
consider packet r out i ng schemes . The basic fat-tree structure and i nt er connect described 
here woul d be appl i cabl e t o a packet swi t ched net work scheme. The di ffer ence that arise 
woul doccur inthe r outi ng protocol and pol icies. Almost all of these differences woul d thus 
be 1 i ni ted to the desi gn of the cache- control 1 er and rout i ng conponent . 

5.3.6 Gbnstruct Prototypes 

Cert ai nl y, the most defini ti ve way to eval uat e the worth of thi s fat - tree network struc- 
ture is to actual 1 y construct prototypes . The const rue ti onexercise will guarantee that no 
essential constructi on det ai 1 s are overlooked. Such const ructi on wi 1 1 , no doubt, uncover 
issues and probl ems not yet considered. 
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